Probability and Information Theory
Random Variables A random variable is a variable that can take on different values randomly. a description of the states that are possible Denoted as a lower case letter discrete or continuous Ex) P(x=‘yes’)
Probability Distributions A probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states.
Discrete Variables and Probability Mass Functions Probability mass function (PMF) A probability distribution over discrete variables may be described using a probability mass function (PMF) maps from a state of a random variable to the probability of that random variable taking on that state P(x=x) : random variable x 가 x 상태(값)을 가질 확률로 매핑
Discrete Variables and Probability Mass Functions Joint probability distribution P(x=x, y=y) denotes the probability that x=x and y=y simultaneously. P(x, y)
Discrete Variables and Probability Mass Functions
Continuous Variables and Probability Density Functions Probability Density Function (PDF)
Marginal Probability 3.4 Marginal Probability 3.4 Marginal Probability x p ( [ p is given by the ( a, b where “parametrized by”; we consider CHAPTER 3. PROBABILITY AND INFORMATION THEORY When working with continuous random variables, we describe probability distri- volume δx is given by p(x)δx. integral of tion on an interval of the real numbers. We can do this with a function lies in the interval [a, b] is given by probability density over a continuous random variable, consider a uniform distribu- b 3.4 Marginal Probability to know the probability distribution over just a subset of them. by writing x ∼ U(a, b). integrates to 1. We often denote that u(x; a, b) = mass outside the interval, we say Sometimes we know the probability distribution over a set of variables and we want 3.3.2 Continuous Variables and Probability Density Functions state directly, instead the probability of landing inside an infinitesimal region with butions using a set of points. Specifically, the probability that mass function. To be a probability density function, a function following properties: are parameters that define the function. To ensure that there is no probability A probability density function • The domain of p must be the set of all possible states of x. For an example of a probability density function corresponding to a specific • ∀x ∈ x, p(x) ≥ 0. Note that we do not require p(x) ≤ 1. • We can integrate the density function to find the actual probability mass of a a p(x)dx = 1. and p b−a ( b 1 are the endpoints of the interval, with ) over that set. In the univariate example, the probability that . We can see that this is nonnegative everywhere. Additionally, it probability density function (PDF) x to be the argument of the function, while u ( [a,b] x x ) does not give the probability of a specific x ; a, b p(x)dx. follows the uniform distribution on [ ) = 0 for all x lies in some set b > a x ∈ rather than a probability . The “;” notation means a, b S ]. Within [ must satisfy the x ] u ; x ], a, b a, b and a ), x p ( [ p is given by the ( a, b CHAPTER 3. PROBABILITY AND INFORMATION THEORY volume δx is given by p(x)δx. “parametrized by”; we consider When working with continuous random variables, we describe probability distri- where probability density over a continuous random variable, consider a uniform distribu- set of points. Specifically, the probability that state directly, instead the probability of landing inside an infinitesimal region with integral of b tion on an interval of the real numbers. We can do this with a function integrates to 1. We often denote that 3.3.2 Continuous Variables and Probability Density Functions Sometimes we know the probability distribution over a set of variables and we want to know the probability distribution over just a subset of them. lies in the interval [a, b] is given by butions using a 3.4 Marginal Probability mass outside the interval, we say u(x; a, b) = following properties: by writing x ∼ U(a, b). mass function. To be a probability density function, a function are parameters that define the function. To ensure that there is no probability For an example of a probability density function corresponding to a specific • ∀x ∈ x, p(x) ≥ 0. Note that we do not require p(x) ≤ 1. A probability density function We can integrate the density function to find the actual probability mass of a • The domain of p must be the set of all possible states of x. • a p(x)dx = 1. and p b−a ( b 1 are the endpoints of the interval, with ) over that set. In the univariate example, the probability that . We can see that this is nonnegative everywhere. Additionally, it probability density function (PDF) x to be the argument of the function, while u ( x [a,b] x ) does not give the probability of a specific x ; a, b p(x)dx. follows the uniform distribution on [ ) = 0 for all x lies in some set b > a x ∈ rather than a probability . The “;” notation means a, b S ]. Within [ must satisfy the a, b ] and a x x u ), ], ; a, b Marginal Probability The probability distribution over just a subset of them.
Conditional Probability The probability of some event, given that some other event has happened
The Chain Rule of Conditional Probabilities Any joint probability distribution over many random variables may be decomposed into conditional distributions over only one variable Chain rule
Independence and Conditional Independence Two random variables x and y are independent Conditionally independent
Expectation The expectation or expected value of some function f(x) with respect to a probability distribution P(x) the average or mean value that f takes on when x is drawn from P
Expectation Expectations are linear
Variance a measure of how much the values of a function of a random variable x vary as we sample different values of x from its probability distribution
Covariance Gives some sense of how much two values are linearly related to each other
Bernoulli Distribution a distribution over a single binary random variable
Multinoulli Distribution The multinoulli or categorical distribution is a distribution over a single discrete variable with k different states, where k is finite. parametrized by a vector p ∈[0,1]k−1, where pi gives the probability of the i-th state. The final k-th state’s probability is given by 1− 1Tp.
Gaussian Distribution The most commonly used distribution over real numbers
Gaussian Distribution
Gaussian Distribution Multivariate normal distribution
Exponential distribution In the context of deep learning, we often want to have a probability distribution with a sharp point at x= 0
Laplace distribution
Mixtures of Distributions A mixture distribution is made up of several component distributions.
Bayes’ Rule