Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School.

Slides:



Advertisements
Similar presentations
Kompresja danych... accept the convention that the word "compression" be encoded as "comp".... ZIP file format... both the sender and receiver of the information.
Advertisements

Random Processes Introduction (2)
Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011.
Lecture 2: Basic Information Theory TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency (FOI)
LIN3022 Natural Language Processing Lecture 4 Albert Gatt LIN Natural Language Processing.
ฟังก์ชั่นการแจกแจงความน่าจะเป็น แบบไม่ต่อเนื่อง Discrete Probability Distributions.
Chain Rules for Entropy
Chapter 6 Information Theory
ENGS Lecture 8 ENGS 4 - Lecture 8 Technology of Cyberspace Winter 2004 Thayer School of Engineering Dartmouth College Instructor: George Cybenko,
A Bit of Information Theory Unsupervised Learning Working Group Assaf Oron, Oct Based mostly upon: Cover & Thomas, “Elements of Inf. Theory”,
Texture This isn’t described in Trucco and Verri Parts are described in: – Computer Vision, a Modern Approach by Forsyth and Ponce –“Texture Synthesis.
Announcements Problem Set 2 distributed. –Due March 10 weeks. –Has been shortened, due to snow. –Tour of skeleton code. #include Problem Set 1, accepted.
Machine Learning CMPT 726 Simon Fraser University
Information Theory and Security. Lecture Motivation Up to this point we have seen: –Classical Crypto –Symmetric Crypto –Asymmetric Crypto These systems.
February 3, 2010Harvard QR481 Coding and Entropy.
Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.
Lecture 2: Basic Information Theory Thinh Nguyen Oregon State University.
CSI Uncertainty in A.I. Lecture 201 Basic Information Theory Review Measuring the uncertainty of an event Measuring the uncertainty in a probability.
Information Theory and Security
LIN3022 Natural Language Processing Lecture 5 Albert Gatt LIN Natural Language Processing.
Decision Trees and Information: A Question of Bits Great Theoretical Ideas In Computer Science Steven Rudich, Anupam GuptaCS Spring 2004 Lecture.
Lecture 3. Relation with Information Theory and Symmetry of Information Shannon entropy of random variable X over sample space S: H(X) = ∑ P(X=x) log 1/P(X=x)‏,
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Basic Concepts in Information Theory
Some basic concepts of Information Theory and Entropy
Introduction to AEP In information theory, the asymptotic equipartition property (AEP) is the analog of the law of large numbers. This law states that.
Prof. SankarReview of Random Process1 Probability Sample Space (S) –Collection of all possible outcomes of a random experiment Sample Point –Each outcome.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
1 Advanced Smoothing, Evaluation of Language Models.
Entropy and some applications in image processing Neucimar J. Leite Institute of Computing
Information and Coding Theory
2. Mathematical Foundations
Formal Models of Language. Slide 1 Language Models A language model an abstract representation of a (natural) language phenomenon. an approximation to.
EEET 5101 Information Theory Chapter 1
All of Statistics Chapter 5: Convergence of Random Variables Nick Schafer.
IRDM WS Chapter 2: Basics from Probability Theory and Statistics 2.1 Probability Theory Events, Probabilities, Random Variables, Distributions,
(Important to algorithm analysis )
Channel Capacity.
Introduction to Information Theory Lecture 7I400/I590 Artificial Life as an approach to Artificial Intelligence Larry Yaeger Professor of Informatics,
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
Summer 2004CS 4953 The Hidden Art of Steganography A Brief Introduction to Information Theory  Information theory is a branch of science that deals with.
Kolmogorov Complexity and Universal Distribution Presented by Min Zhou Nov. 18, 2002.
Information Theory Basics What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel.
Outline Transmitters (Chapters 3 and 4, Source Coding and Modulation) (week 1 and 2) Receivers (Chapter 5) (week 3 and 4) Received Signal Synchronization.
Coding Theory Efficient and Reliable Transfer of Information
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Source Coding Efficient Data Representation A.J. Han Vinck.
Basic Principles (continuation) 1. A Quantitative Measure of Information As we already have realized, when a statistical experiment has n eqiuprobable.
Hidden Markov Models: Probabilistic Reasoning Over Time Artificial Intelligence CMSC February 26, 2008.
1 EE571 PART 3 Random Processes Huseyin Bilgekul Eeng571 Probability and astochastic Processes Department of Electrical and Electronic Engineering Eastern.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Decision Trees and Information: A Question of Bits Great Theoretical Ideas In Computer Science Anupam GuptaCS Fall 2005 Lecture 24Nov 17, 2005Carnegie.
1Causal Performance Models Causal Models for Performance Analysis of Computer Systems Jan Lemeire TELE lab May 24 th 2006.
Oliver Schulte Machine Learning 726 Decision Tree Classifiers.
Chapter 6 Large Random Samples Weiqi Luo ( 骆伟祺 ) School of Data & Computer Science Sun Yat-Sen University :
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
SEAC-3 J.Teuhola Information-Theoretic Foundations Founder: Claude Shannon, 1940’s Gives bounds for:  Ultimate data compression  Ultimate transmission.
Probability Distribution. Probability Distributions: Overview To understand probability distributions, it is important to understand variables and random.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
Introduction to N-grams Language Modeling IP disclosure: Content borrowed from J&M 3 rd edition and Raymond Mooney.
Statistical methods in NLP Course 2 Diana Trandab ă ț
Statistical methods in NLP Course 2
Decision Trees and Information: A Question of Bits
Introduction to Information theory
Communication Amid Uncertainty
Entropy CSCI284/162 Spring 2009 GWU.
Lecture 2: Basic Information Theory
Lecture 2 Basic Concepts on Probability (Section 0.2)
Presentation transcript:

Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Goal: A gentle introduction to the basic concepts in information theory Emphasis: understanding and interpretations of these concepts Reference: Elements of Information Theory by Cover and Thomas

Topics Entropy and relative entropy Asymptotic equipartition property Data compression Large deviation Kolmogorov complexity Entropy rate of process

Entropy Randomness or uncertainty of a probability distribution Example

Definition Entropy

Example

Definition for both discrete and continuous Recall Entropy

Example Entropy

Interpretation 1: cardinality Uniform distribution There are elements in All these choices are equally likely Entropy can be interpreted as log of volume or size dimensional cube has vertices can also be interpreted as dimensionality What if the distribution is not uniform?

Asymptotic equipartition property Any distribution is essentially a uniform distribution in long run repetition independently Recall if then a constant Random? But in some sense, it is essentially a constant!

Law of large number independently Long run average converges to expectation

Asymptotic equipartition property Intuitively, in the long run,

Recall ifthen a constant Therefore, as if,with So the dimensionality per observation is Asymptotic equipartition property We can make it more rigorous

independently for Weak law of large number

Typical set,with Typical set

: The set of sequences for sufficiently large Typical set,with

Interpretation 2: coin flipping Flip a fair coin  {Head, Tail} Flip a fair coin twice independently  {HH, HT, TH, TT} …… Flip a fair coin times independently  equally likely sequences We may interpret entropy as the number of flips

Example The above uniform distribution amounts to 2 coin flips Interpretation 2: coin flipping

,with amounts to flips amounts to Interpretation 2: coin flipping

Interpretation 3: coding Example

,with How many bits to code elements in? Can be made more formal using typical set bits Interpretation 3: coding

 abacbd Prefix code

 abacbd Sequence of coin flipping A completely random sequence Cannot be further compressed Optimal code e.g., two words I, probability

Optimal code Kraft inequality for prefix code Minimize Optimal length

Wrong model Optimal code Wrong code Redundancy Box: All models are wrong, but some are useful

Relative entropy Kullback-Leibler divergence

Relative entropy Jensen inequality

Types independently number of times normalized frequency

Law of large number Refinement

Law of large number Refinement Large deviation

Kolmogorov complexity Example: a string …011 Program: for (i =1 to n/3) write(011) end Can be translated to binary machine code Kolmogorov complexity = length of shortest machine code that reproduce the string no probability distribution involved If a long sequence is not compressible, then it has all the statistical properties of a sequence of coin flipping string = f(coin flippings)

Joint and conditional entropy Joint distribution Marginal distribution e.g., eye color & hair color

Joint and conditional entropy Conditional distribution Chain rule

Joint and conditional entropy

Chain rule

Mutual information

Entropy rate Stochastic processnot independent Markov chain: Entropy rate: Stationary process: Stationary Markov chain: (compression)

Shannon, Zero-order approximation XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD. 2. First-order approximation OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL. 3. Second-order approximation (digram structure as in English). ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE. 4. Third-order approximation (trigram structure as in English). IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE. 5. First-order word approximation. REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE. 6. Second-order word approximation. THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

Summary Entropy of a distribution measures randomness or uncertainty log of the number of equally likely choices average number of coin flips average length of prefix code (Kolmogorov: shortest machine code  randomness) Relative entropy from one distribution to the other measure the departure from the first to the second coding redundancy large deviation Conditional entropy, mutual information, entropy rate