Of Probability & Information Theory Alexander G. Ororbia II The Pennsylvania State University IST 597: Foundations of Deep Learning
Probability Mass Function Example: uniform distribution:
Probability Density Function Example: uniform distribution:
Computing Marginal Probability with the Sum Rule Summation Discrete random variables! Integration Continuous random variables!
Conditional Probability In probability theory, conditional probability is a measure of the probability of an event given that (by assumption, presumption, assertion or evidence) another event has occurred
Chain Rule of Probability In probability theory, the chain rule (also called the general product rule) permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities. The rule is useful in the study of Bayesian networks, which describe a probability distribution in terms of conditional probabilities.
Independence In probability theory, two events are independent, statistically independent, or stochastically independent if the occurrence of one does not affect the probability of occurrence of the other. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.
Conditional Independence In probability theory, two events R and B are conditionally independent given a third event Y precisely if the occurrence of R and the occurrence of B are independent events in their conditional probability distribution given Y. In other words, R and B are conditionally independent given Y if and only if, given knowledge that Y occurs, knowledge of whether R occurs provides no information on the likelihood of B occurring, and knowledge of whether B occurs provides no information on the likelihood of R occurring.
linearity of expectations:
Variance and Covariance Covariance matrix: In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., the variables tend to show similar behavior, the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables.
Bernoulli Distribution Can prove/derive each of these properties!
Gaussian Distribution Parametrized by variance: Parametrized by precision:
Gaussian Distribution Figure 3.1
Multivariate Gaussian Parametrized by covariance matrix: Parametrized by precision matrix:
More Distributions Exponential: Laplace: Dirac: The density of an idealized point mass or point charge, as a function that is equal to zero everywhere except for zero and whose integral over the entire real line is equal to one
Laplace Distribution
Empirical Distribution An empirical distribution function is the distribution function associated with the empirical measure of a sample.
Mixture Distributions In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized. Gaussian mixture with three components Figure 3.2
Commonly used to parametrize Bernoulli distributions Logistic Sigmoid Commonly used to parametrize Bernoulli distributions
Softplus Function
Fundamental to statistical learning!! Memorize this rule! Bayes’ Rule Fundamental to statistical learning!! Memorize this rule!
In English Please? What does Bayes’ Formula helps to find? Helps us to find: By having already known:
Why do we care? Deep generative models!
Sparse Coding
(Probabilistic) Sparse Coding
Preview: Information Theory Entropy: KL divergence:
The KL Divergence is Asymmetric Mean-seeking! Mode-Seeking! Figure 3.6
Directed Model Figure 3.7
Undirected Model Figure 3.8
References This is a variation presentation of Ian Goodfellow’s slides, for Chapter 3 of Deep Learning (http://www.deeplearningbook.org/lecture_slides.ht ml)