Basic Statistics and Shannon Entropy Ka-Lok Ng Asia University.

Slides:



Advertisements
Similar presentations
Chapter 3 Properties of Random Variables
Advertisements

Chapter 5 Discrete Random Variables and Probability Distributions
1 MF-852 Financial Econometrics Lecture 3 Review of Probability Roy J. Epstein Fall 2003.
Chain Rules for Entropy
Chapter 4 Discrete Random Variables and Probability Distributions
Background Knowledge Brief Review on Counting,Counting, Probability,Probability, Statistics,Statistics, I. TheoryI. Theory.
Today Today: More of Chapter 2 Reading: –Assignment #2 is up on the web site – –Please read Chapter 2 –Suggested.
Class notes for ISE 201 San Jose State University
17 Correlation. 17 Correlation Chapter17 p399.
Basic Statistical Concepts Psych 231: Research Methods in Psychology.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 6: Correlation.
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
Statistical Analysis of Microarray Data
Chapter 5 Continuous Random Variables and Probability Distributions
Statistical Analysis of Microarray Data
Information Theory and Security
Random Variable and Probability Distribution
Separate multivariate observations
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
Probability and Probability Distributions
§1 Entropy and mutual information
STATISTIC & INFORMATION THEORY (CSNB134)
Problem A newly married couple plans to have four children and would like to have three girls and a boy. What are the chances (probability) their desire.
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
OUTLINE Probability Theory Linear Algebra Probability makes extensive use of set operations, A set is a collection of objects, which are the elements.
Covariance and correlation
Correlation.
Chapter 5 Discrete Random Variables and Probability Distributions ©
Correlation1.  The variance of a variable X provides information on the variability of X.  The covariance of two variables X and Y provides information.
§4 Continuous source and Gaussian channel
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Review and Preview This chapter combines the methods of descriptive statistics presented in.
Lecture 3 A Brief Review of Some Important Statistical Concepts.
Theory of Probability Statistics for Business and Economics.
1 Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND STATISTICS FOR SCIENTISTS AND ENGINEERS Systems.
Copyright © 2012 Pearson Education. Chapter 23 Nonparametric Methods.
WELCOME TO THETOPPERSWAY.COM.
0 K. Salah 2. Review of Probability and Statistics Refs: Law & Kelton, Chapter 4.
Nonparametric Statistics aka, distribution-free statistics makes no assumption about the underlying distribution, other than that it is continuous the.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
COMMUNICATION NETWORK. NOISE CHARACTERISTICS OF A CHANNEL 1.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
Information Theory Basics What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
CORRELATION. Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson’s coefficient of correlation.
Expectation. Let X denote a discrete random variable with probability function p(x) (probability density function f(x) if X is continuous) then the expected.
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Probability and Distributions. Deterministic vs. Random Processes In deterministic processes, the outcome can be predicted exactly in advance Eg. Force.
Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
Computing Fundamentals 2 Lecture 7 Statistics, Random Variables, Expected Value. Lecturer: Patrick Browne
1 Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND STATISTICS FOR SCIENTISTS AND ENGINEERS Systems.
Chapter 4 Discrete Random Variables and Probability Distributions
Biostatistics Class 3 Probability Distributions 2/15/2000.
Statistical methods in NLP Course 2 Diana Trandab ă ț
Statistical methods in NLP Course 2
CORRELATION.
RAINGUAGE NETWORK DESIGN
Corpora and Statistical Methods
Hiroki Sayama NECSI Summer School 2008 Week 3: Methods for the Study of Complex Systems Information Theory p I(p)
3.1 Expectation Expectation Example
Review of Probability and Estimators Arun Das, Jason Rebello
Chapter 4 – Part 3.
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Basic Statistics and Shannon Entropy Ka-Lok Ng Asia University

Mean and Standard Deviation (SD) Compare distributions having the same mean value –a small SD value  a narrow distribution –a large SD value  a wide distribution

Pearson correlation coefficient or Covariance Statistics – standard deviation and variance, var(X)=s 2, for 1-dimension data How about higher dimension data ? - It is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other. - Covariance is measured between 2 dimensions, - suppose one have a 3-dimension data set (X,Y,Z), then one can calculate Cov(X,Y), Cov(X,Z) and Cov(Y,Z) - to compare heterogenous pairs of variables, define the correlation coefficient or Pearson correlation coefficient, -1 ≦  XY ≦ 1 -1  perfect anticorrelation 0  independent +1  perfect correlation

The squared Pearson correlation coefficient Pearson correlation coefficient is useful for examining correlations in the data One may imagine an instance, for example, in which an event can cause both enhancement and repression. A better alternative is the squared Pearson correlation coefficient (pcc), The square pcc takes the values in the range 0 ≦  sq ≦ 1. 0  uncorrelate vector 1  perfectly correlated or anti-correlated pcc are measures of similarity Similarity and distance have a reciprocal relationship similarity↑  distance↓  d = 1 –  is typically used as a measure of distance

Pearson correlation coefficient or Covariance - The resulting  XY value will be larger than 0 if a and b tend to increase together, below 0 if they tend to decrease together, and 0 if they are independent. Remark:  XY only test whether there is a linear dependence, Y=aX+b - if two variables independent  low  XY, - a low  XY may or may not  independent, it may be a non-linear relation, for example, y=sin x) - a high  XY is a sufficient but not necessary condition for variable dependence

Pearson correlation coefficient To test for a non-linear relation among the data, one could make a transformation by variables substitution Suppose one wants to test the relation u(v) = av n Take logarithm on both sides log u = log a + n log v Set Y = log u, b = log a, and X = log v  a linear relation, Y = b + nX  log u correlates (n>0) or anti-correlates (n<0) with log v

Pearson correlation coefficient or Covariance matrix A covariance matrix is merely collection of many covariances in the form of a d x d matrix:

Spearman’s rank correlation One of the problems with using the PCC is that it is susceptible to being skewed by outliers: a single data point can result in situation appearing to be correlated, even when all the other data points suggest that they are not. Spearman’s rank correlation (SRC) is a non-parametric measure of correlation that is robust to outliers. SRC is a measure that ignores the magnitude of the changes. The idea of the rank correlation is to transform the original values into ranks, and then to compute the correlation between the series of ranks. First we order the values of gene A and B in ascending order, and assign the lowest value with rank 1. The SRC between A and B is defined as the PCC between ranked A and B. In case of ties assign midranks  both are ranked 5, then assign a rank of 5.5

Spearman’s rank correlation The SRC can be calculated by the following formula, where x i and y i denote the rank of the x and y respectively. An approximate formula in case of ties is given by

Distances in discretized space Sometimes one has to due with discreteized values The similarity between two discretized vectors can be measured by the notion of Shannon entropy.

Entropy and the Second Law of Thermodynamics: Disorder and the Unavailability of energy Ice melts, it becomes more disordered and less structured. Entropy always increase

Statistical Interpretation of Entropy and the Second Law S = k ln  S = entropy, k = Boltzmann constant, ln  = natural logarithm of the number of microstates  corresponding to the given macrostate. L. Boltzmann ( )

Entropy and the Second Law of Thermodynamics: Disorder and the Unavailability of energy

Concept of entropy Toss 5 coins, outcome 5H0T1 4H1T5 3H2T10 2H3T10 1H4T5 0H5T1 A total of 32 microstates. Propose entropy, S ~ no. of microstates,  i.e. S ~  Generate coin toss with Excel The most probable microstates

Shannon entropy Shannon entropy is related to physical entropy Shannon ask the question “What is information ?” Energy is defined as the capacity to do work, not the work itself. Work is a form of energy. Define information as the capacity to store and transmit meaning or knowledge, not the meaning or knowledge itself. For example, a lot of information from WWW, but it does not mean knowledge Shannon suggest entropy is the measure of this capacity Summary Define information  capacity to store knowlege  entropy is the measure  Shannon entropy Entropy ~ randomness ~ measure of capacity to store and transmit knowledge Reference: Gatlin L.L, Information theory and the living system, Columbia University Press, New York, 1972.

Shannon entropy How to relate randomness and measure of this capacity ? Microstates 5H0T1 4H1T5 3H2T10 2H3T10 1H4T5 0H5T1 Physical entropy S = k ln  Shannon entropy Assuming equal probability of each individual microstate, p i p i = 1/  S = - k ln p i Information ~ 1/p i =  If p i = 1  there is no information, because it means certainty If p i << 1  there are more information, that is information is a decrease in uncertainty

Distances in discretized space Sometimes it is advantageous to use a discretized expression matrix as the starting point, e.g. to assign values 0 (expression unchanged, 1 (expression increased) and -1 (expression decreased). The similarity between two discretized vectors can be measured by the notion of Shannon entropy. Shannon entropy, H 1 -Probability of observing a particular symbol or event, p i, with in a given sequence Consider a binary system, an element X has two states, 0 or 1 base 2 References: plus.maths.org/issue23/ features/data/plus.maths.org/issue23/ features/data/ Claude Shannon - father of information theory - H 1 measure the “ uncertainty ” of a probability distribution - Expectation (average) value of information and

Shannon Entropy pipi 1,0-1*1[log 2 (1)] = 0 ½, ½-2*1/2[log 2 (1/2)]= states 1/4-4*1/4[log 2 (1/4)]=2 2 N -states 1/2 N -2 N *1/2 N [log 2 (2 N )]=N Uniform probability certain No information Maximal value DNA seq.  n = 4 states, maximum H 1 = - 4*(1/4)* log(1/4) = 2 bits Protein seq.  n = 20 states,  maximum = - 20*(1/20)*log(1/20) = bits, which is between 4 and 5 bits.

The Divergence from equi-probability When all letter are equi-probable, p i = 1/n H 1 = log 2 (n)  the maximum value H 1 can take Define H max 1 = log 2 (n) Define the divergence from this equi-probable state, D 1 D 1 = H max 1 - H 1 D 1 tells us how much of the total divergence from the maximum entropy state is due to the divergence of the base composition from a uniform distribution For example, E. coli genome has no divergence from equi-probability because H 1 Ec = 2 bits, but, for M. lysodeikticus genome, H 1 Ml = 1.87, then D 1 = 2.00 – 1.87 = 0.13 bit Divergence from independence Single-letter events  which contains no information about how these letters are arranged in a linear sequence H max 1 H 1 = log 2 (n) D1D1

Divergence from independence – Conditional Entropy Question Does the occurrence of any one base along the DNA seq. alter the probability of occurrence of the base next to it ?  What are the numerical values of the conditional probabilities ?  p(X|Y) = prob. of event X condition on event Y  p(A|A), p(T|A), p(C|A), p(T|A) … etc.  If they were independent, p(A|A) = p(A), p(T|A) = p(T) ….  Extreme ordering case, equi-probable seq., AAAA…TTTT…CCCC…GGGG…  p(A|A) is very high, p(T|A) is very low, p(C|A) = 0, p(G|A) = 0  Extreme case, ATCGATCGATCG….  Here p(T|A) = p(C|T) = p(G|C) = p(A|G) = 1, and all others are 0  Equi-probable state ≠ independent events

Divergence from independence – Conditional Entropy Consider the space of DNA dimers (nearest neighbor) S 2 = {AA, AT, …. TT} Entropy of S 2, H 2 = -[p(AA)log p(AA) + p(AT) log p(AT) + …. + p(TT) log(TT)] If the single letter events are independent, p(X|Y) = p(X), If the dimer event is independent, p(A|A)=p(A)p(A), p(A|T)=p(A)p(T), …. If the dimer is not independent, p(XY) = p(X)p(Y|X), such as p(AA) = p(A)p(A|A), p(AT) = p(A) p(T|A) … etc. H Inp 2 = entropy of completely independent Divergence from independence, D 2 = H Inp 2 – H 2 D 1 + D 2 = the total divergence from the maximum entropy state

Divergence from independence – Conditional Entropy Calculate D 1 and D 2 for M. phlei DNA, where p(A)=0.164, p(T)=0.162, p(C)=0.337, p(G)= H 1 = -(0.164 log log ) = bits D 1 = – = bit See the Excel file D 2 = H Inp 2 – H 2 = – = bit Total divergence, D 1 + D 2 = = bit

Divergence from independence – Conditional Entropy - Compare different sequences using H to establish relationships -Given the knowledge of one sequence, say X, can we estimate the uncertainty of Y relative to X ? -Relation between X, Y, and the conditional entropy, H(X|Y) and H(Y|X) -conditional entropy is the uncertainty relative to known information H(X,Y) = H(Y|X) + H(X) = uncertainty of Y given knowledge of X, H(Y|X) + uncertainty of X, sum to the entropy of X and Y = H(X|Y) + H(Y) Y=2 x log 10 Y=xlog 10 2 X=log 10 Y/log 10 2 H(Y|X) = H(X,Y) – H(X) = 1.85 – 0.97 = 0.88 bit where

Shannon Entropy – Mutual Information Joint entropy H(X,Y) where p ij is the joint probability of finding x i and y j probability of finding (X,Y) p 00 = 0.1, p 01 = 0.3, p 10 = 0.4, p 11 = 0.2 Mutual information entropy, M(X,Y) Information shared by X and Y, or it can be used as a similarity measure between X and Y H(X,Y)= H(X) + H(Y) – M(X,Y)  like in set theory, A ∪ B = A + B – (A∩B) M(X,Y)= H(X) + H(Y) - H(X,Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) = 1.00 – 0.88 M(X,Y)= H(X) + H(Y) – H(X,Y) = – 1.85 = 0.12 bit and

Shannon Entropy – Conditional Entropy H(Y|X) f(Y|X=x) 0 0 1/ / / /6 Conditional entropy a particular x All x’s