A Bit of Information Theory Unsupervised Learning Working Group Assaf Oron, Oct. 15 2003 Based mostly upon: Cover & Thomas, “Elements of Inf. Theory”,

Slides:



Advertisements
Similar presentations
Ulams Game and Universal Communications Using Feedback Ofer Shayevitz June 2006.
Advertisements

Lecture 2: Basic Information Theory TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency (FOI)
Entropy and Information Theory
Information Theory EE322 Al-Sanie.
Some Common Binary Signaling Formats: NRZ RZ NRZ-B AMI Manchester.
Chain Rules for Entropy
Chapter 6 Information Theory
For stimulus s, have estimated s est Bias: Cramer-Rao bound: Mean square error: Variance: Fisher information How good is our estimate? (ML is unbiased:
Fundamental limits in Information Theory Chapter 10 :
Revision of Chapter III For an information source {p i, i=1,2,…,N} its entropy is defined by Shannon’s first theorem: For an instantaneous coding, we have.
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Entropy. Optimal Value Example The decimal number 563 costs 10  3 = 30 units. The binary number costs 2  10 = 20 units.  Same value as decimal.
Machine Learning CMPT 726 Simon Fraser University
Molecular Information Theory Niru Chennagiri Probability and Statistics Fall 2004 Dr. Michael Partensky.
3-1 Introduction Experiment Random Random experiment.
Review of Probability and Random Processes
Information Theory Kenneth D. Harris 18/3/2015. Information theory is… 1.Information theory is a branch of applied mathematics, electrical engineering,
Noise, Information Theory, and Entropy
Statistical Theory; Why is the Gaussian Distribution so popular? Rob Nicholls MRC LMB Statistics Course 2014.
Noise, Information Theory, and Entropy
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
Basic Concepts in Information Theory
Some basic concepts of Information Theory and Entropy
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
STATISTIC & INFORMATION THEORY (CSNB134)
2. Mathematical Foundations
Information Theory & Coding…
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
Tahereh Toosi IPM. Recap 2 [Churchland and Abbott, 2012]
Richard W. Hamming Learning to Learn The Art of Doing Science and Engineering Session 13: Information Theory ` Learning to Learn The Art of Doing Science.
Basic Concepts of Encoding Codes, their efficiency and redundancy 1.
Channel Capacity.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Prepared by: Amit Degada Teaching Assistant, ECED, NIT Surat
COMMUNICATION NETWORK. NOISE CHARACTERISTICS OF A CHANNEL 1.
1 Information in Continuous Signals f(t) t 0 In practice, many signals are essentially analogue i.e. continuous. e.g. speech signal from microphone, radio.
Summer 2004CS 4953 The Hidden Art of Steganography A Brief Introduction to Information Theory  Information theory is a branch of science that deals with.
A Mathematical Theory of Communication Jin Woo Shin Sang Joon Kim Paper Review By C.E. Shannon.
Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Information & Communication INST 4200 David J Stucki Spring 2015.
Information Theory The Work of Claude Shannon ( ) and others.
Outline Transmitters (Chapters 3 and 4, Source Coding and Modulation) (week 1 and 2) Receivers (Chapter 5) (week 3 and 4) Received Signal Synchronization.
Coding Theory Efficient and Reliable Transfer of Information
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Source Coding Efficient Data Representation A.J. Han Vinck.
Prepared by: Engr. Jo-Ann C. Viñas 1 MODULE 2 ENTROPY.
Entropy (YAC- Ch. 6)  Introduce the thermodynamic property called Entropy (S)  Entropy is defined using the Clausius inequality  Introduce the Increase.
Presented by Minkoo Seo March, 2006
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Lecture 3, CS5671 Information theory Uncertainty –Can we measure it? –Can we work with it? Information (Uncertainty == Information)? Related concepts Surprise,
Review of statistical modeling and probability theory Alan Moses ML4bio.
1 CSCD 433 Network Programming Fall 2013 Lecture 5a Digital Line Coding and other...
Oliver Schulte Machine Learning 726 Decision Tree Classifiers.
1 Review of Probability and Random Processes. 2 Importance of Random Processes Random variables and processes talk about quantities and signals which.
SEAC-3 J.Teuhola Information-Theoretic Foundations Founder: Claude Shannon, 1940’s Gives bounds for:  Ultimate data compression  Ultimate transmission.
UNIT I. Entropy and Uncertainty Entropy is the irreducible complexity below which a signal cannot be compressed. Entropy is the irreducible complexity.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
1 CSCD 433 Network Programming Fall 2016 Lecture 4 Digital Line Coding and other...
Statistical methods in NLP Course 2 Diana Trandab ă ț
Statistical methods in NLP Course 2
Shannon Entropy Shannon worked at Bell Labs (part of AT&T)
Analog to digital conversion
Outline Introduction Signal, random variable, random process and spectra Analog modulation Analog to digital conversion Digital transmission through baseband.
Information Theory Michael J. Watts
COT 5611 Operating Systems Design Principles Spring 2012
COT 5611 Operating Systems Design Principles Spring 2014
Quantum Information Theory Introduction
CSCD 433 Network Programming
Presentation transcript:

A Bit of Information Theory Unsupervised Learning Working Group Assaf Oron, Oct Based mostly upon: Cover & Thomas, “Elements of Inf. Theory”, 1991

Contents Coding and Transmitting Information Entropy etc. Information Theory and Statistics Information Theory and “Machine Learning”

What is Coding? ( 1 ) We keep coding all the time Crucial requirement for coding: “source” and “receiver” agree on the key. Modern coding: telegraph->radio->… –Practical problems: How efficient can we make it? Tackled from 20’s on. –1940’s: Claude Shannon

What is Coding? ( 2 ) Shannon’s greatness: finding a solution of the “specific” problem, by working on the “general” problem. Namely: how does one quantify information, its coding and its transmission? –ANY type of information

Some Day-to-Day Codes Code“Channel”Unique? Instant? Spoken LanguageSounds via airWell… Written LanguageSigns on paper/screen Well… Numbers and mathSigns on paper/screen, electronic, etc. Usually (decimal point, operation signs, etc.) DNA protein codeNucleotide pairsYes (start, end, 3- somes)

Information Complexity of Some Coded Messages Let’s think written numbers: –k digits → 10 k possible messages How about written English? –k letters → 26 k possible messages –k words → D k possible messages, where D is English dictionary size ∴ Length ~ log(complexity)

Information Entropy The expected length (bits) of a binary message conveying x-type information – other common descriptions: “code complexity”, “uncertainty”, “missing/required information”, “expected surprise”, “information content” (BAD), etc.

Why “Entropy”? Thermodynamics (mid 19 th ): “amount of un-usable heat in system” Statistical Physics (end 19 th ): “log (complexity of current system state)” – ⇉ amount of “mess” in the system –The two were proven to be equivalent –Statistical entropy is proportional to information entropy if p(x) is uniform 2 nd Law of Thermodynamics… –Entropy never decreases (more later)

Entropy Properties, Examples.

Kullback-Leibler Divergence (“Relative Entropy”) In words: “the excess message length needed to use p(x)-optimized code for messages based on q(x)” Properties, Relation to H:

Mutual Information Relationship to D,H (hint: cond. Prob.) : Properties, Examples:

Entropy for Continuous RV’s “Little” h, Defined in the “natural” way However it is not the same measure: –h of discrete RV’s is always 0, and H of continuous RV’s is infinite (measure theory…) For many continuous distributions, h is log (variance) plus some constant –Why?

The Statistical Connection ( 1 ) K-L D ⇔ Likelihood Ratio Law of large numbers can be rephrased as a limit on D For dist.’s with same variance, normal is the one with maximum h. –(2 nd law of thermodynamics revisited) –h is an average quantity. Is the CLT, then, a “law of nature”?… (I think: “YES”!)

The Statistical Connection ( 2 ) Mutual information is very useful –Certainly for discrete RV’s –Also for continuous (no dist. assumptions!) A lot of implications for stochastic processes, as well –I just don’t quite understand them –English?

Machine Learning? (1) So far, we haven’t mentioned noise –In inf. Theory, noise exists in the channel –Channel capacity: max(mutual information) between “source”, “receiver” –Noise directly decreases the capacity Shannon’s “Biggest” result: this can be (almost) achieved with (almost) zero error –Known as the “Channel Coding Theorem”

Machine Learning? (2) The CCT inspired practical developments –Now it all depends on code and channel! –Smarter, “error-correcting” codes –Tech developments focus on channel capacity

Machine Learning? (3) Can you find analogy between coding and classification/clustering? (can it be useful??) CodingM. Learning Source Entropy Variability of Interest Choice of Channel Parameterization Choice of Code Classification Rules Channel noise “Noise”, random errors Channel Capacity Maximum accuracy I (source,receiver) Actual Accuracy

Machine Learning? (4) Inf. Theory tells us that: –We CAN find a nearly optimal classification or clustering rule (“coding”) –We CAN find a nearly optimal parameterization+classification combo –Perhaps the newer wave of successful, but statistically “intractable” methods (boosting etc.) works by increasing channel capacity (i.e, high- dim parameterization)?