Of Probability & Information Theory

Of Probability & Information Theory
Alexander G. Ororbia II The Pennsylvania State University IST 597: Foundations of Deep Learning

Probability Mass Function
Example: uniform distribution:

Probability Density Function
Example: uniform distribution:

Computing Marginal Probability with the Sum Rule
Summation  Discrete random variables! Integration  Continuous random variables!

Conditional Probability
In probability theory, conditional probability is a measure of the probability of an event given that (by assumption, presumption, assertion or evidence) another event has occurred

Chain Rule of Probability
In probability theory, the chain rule (also called the general product rule) permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities. The rule is useful in the study of Bayesian networks, which describe a probability distribution in terms of conditional probabilities.

Independence In probability theory, two events are independent, statistically independent, or stochastically independent if the occurrence of one does not affect the probability of occurrence of the other. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

Conditional Independence
In probability theory, two events R and B are conditionally independent given a third event Y precisely if the occurrence of R and the occurrence of B are independent events in their conditional probability distribution given Y. In other words, R and B are conditionally independent given Y if and only if, given knowledge that Y occurs, knowledge of whether R occurs provides no information on the likelihood of B occurring, and knowledge of whether B occurs provides no information on the likelihood of R occurring.

linearity of expectations:

Variance and Covariance
Covariance matrix: In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., the variables tend to show similar behavior, the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables.

Bernoulli Distribution
Can prove/derive each of these properties!

Gaussian Distribution
Parametrized by variance: Parametrized by precision:

Gaussian Distribution
Figure 3.1

Multivariate Gaussian
Parametrized by covariance matrix: Parametrized by precision matrix:

More Distributions Exponential: Laplace: Dirac:
The density of an idealized point mass or point charge, as a function that is equal to zero everywhere except for zero and whose integral over the entire real line is equal to one

Laplace Distribution

Empirical Distribution
An empirical distribution function is the distribution function associated with the empirical measure of a sample.

Mixture Distributions
In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized. Gaussian mixture with three components Figure 3.2

Commonly used to parametrize Bernoulli distributions
Logistic Sigmoid Commonly used to parametrize Bernoulli distributions

Softplus Function

Fundamental to statistical learning!! Memorize this rule!
Bayes’ Rule Fundamental to statistical learning!! Memorize this rule!

In English Please? What does Bayes’ Formula helps to find?
Helps us to find: By having already known:

Why do we care? Deep generative models!

Sparse Coding

(Probabilistic) Sparse Coding

Preview: Information Theory
Entropy: KL divergence:

The KL Divergence is Asymmetric
Mean-seeking! Mode-Seeking! Figure 3.6

Directed Model Figure 3.7

Undirected Model Figure 3.8

References This is a variation presentation of Ian Goodfellow’s slides, for Chapter 3 of Deep Learning ( ml)

Of Probability & Information Theory

Similar presentations

Presentation on theme: "Of Probability & Information Theory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Of Probability & Information Theory

Similar presentations

Presentation on theme: "Of Probability & Information Theory"— Presentation transcript:

Similar presentations

About project

Feedback