Formal Computational Skills Lecture 9: Probability.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
Lecture (7) Random Variables and Distribution Functions.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 11 (Lab): Probability reminder.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
Lecture 10 – Introduction to Probability Topics Events, sample space, random variables Examples Probability distribution function Conditional probabilities.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
What is Statistical Modeling
Probability Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
Short review of probabilistic concepts
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
Background Knowledge Brief Review on Counting,Counting, Probability,Probability, Statistics,Statistics, I. TheoryI. Theory.
Probability theory Much inspired by the presentation of Kren and Samuelsson.
Introduction to Probability and Statistics
Class notes for ISE 201 San Jose State University
PMAI 2003/04 Targil 1: Basic Probability (see further details in handouts)
Short review of probabilistic concepts Probability theory plays very important role in statistics. This lecture will give the short review of basic concepts.
Machine Learning CMPT 726 Simon Fraser University
PMAI 2001/02 Targil 1: Basic Probability and Information theory
Probability (cont.). Assigning Probabilities A probability is a value between 0 and 1 and is written either as a fraction or as a proportion. For the.
Statistical Analysis Pedro Flores. Conditional Probability The conditional probability of an event B is the probability that the event will occur given.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Econ 482 Lecture 1 I. Administration: Introduction Syllabus Thursday, Jan 16 th, “Lab” class is from 5-6pm in Savery 117 II. Material: Start of Statistical.
Review of Probability.
Probability.
Lecture 10 – Introduction to Probability Topics Events, sample space, random variables Examples Probability distribution function Conditional probabilities.
Recitation 1 Probability Review
Prof. SankarReview of Random Process1 Probability Sample Space (S) –Collection of all possible outcomes of a random experiment Sample Point –Each outcome.
Problem A newly married couple plans to have four children and would like to have three girls and a boy. What are the chances (probability) their desire.
Copyright ©2011 Nelson Education Limited. Probability and Probability Distributions CHAPTER 4 Part 2.
QA in Finance/ Ch 3 Probability in Finance Probability.
Machine Learning Queens College Lecture 3: Probability and Statistics.
Statistics 303 Chapter 4 and 1.3 Probability. The probability of an outcome is the proportion of times the outcome would occur if we repeated the procedure.
1 Probability and Statistics  What is probability?  What is statistics?
Random Sampling, Point Estimation and Maximum Likelihood.
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
Ex St 801 Statistical Methods Probability and Distributions.
Theory of Probability Statistics for Business and Economics.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
1 Psych 5500/6500 Standard Deviations, Standard Scores, and Areas Under the Normal Curve Fall, 2008.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Uncertainty Uncertain Knowledge Probability Review Bayes’ Theorem Summary.
CS433 Modeling and Simulation Lecture 03 – Part 01 Probability Review 1 Dr. Anis Koubâa Al-Imam Mohammad Ibn Saud University
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Class 2 Probability Theory Discrete Random Variables Expectations.
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Probability You’ll probably like it!. Probability Definitions Probability assignment Complement, union, intersection of events Conditional probability.
2. Introduction to Probability. What is a Probability?
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Statistics What is statistics? Where are statistics used?
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Measuring chance Probabilities FETP India. Competency to be gained from this lecture Apply probabilities to field epidemiology.
G. Cowan Lectures on Statistical Data Analysis Lecture 8 page 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem 2Random variables and.
Probability Theory Modelling random phenomena. Permutations the number of ways that you can order n objects is: n! = n(n-1)(n-2)(n-3)…(3)(2)(1) Definition:
1 Probability: Introduction Definitions,Definitions, Laws of ProbabilityLaws of Probability Random VariablesRandom Variables DistributionsDistributions.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Outline Historical note about Bayes’ rule Bayesian updating for probability density functions –Salary offer estimate Coin trials example Reading material:
AP Statistics From Randomness to Probability Chapter 14.
Theoretical distributions: the Normal distribution.
Pattern Recognition Probability Review
MECH 373 Instrumentation and Measurements
Probability Theory and Parameter Estimation I
Bayes Net Learning: Bayesian Approaches
2. Introduction to Probability
Review of Probability and Estimators Arun Das, Jason Rebello
CSCI 5832 Natural Language Processing
Statistical NLP: Lecture 4
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Formal Computational Skills Lecture 9: Probability

Summary By the end you will be able to: Calculate the probability of an event Calculate conditional probabilities with Bayes Theorem Use probability distributions and probability density functions to calculate probabilities Calculate the entropy of a system – easier than people think Motivation: used throughout Alife/AI: classification, data mining, stochastic optimisation (ie GAs etc), evolutionary theory, fuzzy systems etc etc etc

Probability Definition Frequentist: probability is the number of times an event would occur if the conditions were repeated many times Subjectivist: level of belief a rational being has that an event will occur Not really an issue here about which school one adheres to We will use: Probability of an event A (written as P(A)) is the number of ways A can happen divided by the total number of possible outcomes Often calculated by empirical experiment (counting events) 2 schools of thought:

Eg Suppose we throw 2 dice. Define a random variable A = sum of the 2 dice , 11, 21, 31, 41, 51,61/6 22, 12, 22, 32, 42, 52,61/6 33, 13, 23, 33, 43, 53,61/6 44, 14, 24, 34, 44, 54,61/6 55, 15, 25, 35, 45, 55,61/6 66, 16, 26,36,46,56,61/6 P(A=10) = number of ways we can get 10/ number of outcomes We get 10 if we get {5,5}, {6,4} or {4,6} and there are 36 possible outcomes so: P(A=10) = 3/36 = 1/12 What is the probability that A is 10 written as: P(A=10) ? , 11, 21, 31, 41, 51,61/6 22, 12, 22, 32, 42, 52,61/6 33, 13, 23, 33, 43, 53,61/6 44, 14, 24, 34, 44, 54,61/6 55, 15, 25, 35, 45, 55,61/6 66, 16, 26,36,46,56,61/6

If 2 events are independent, the outcome of one has no bearing on the outcome of the other (eg tossing 2 dice) , 11, 21, 31, 41, 51,61/6 22, 12, 22, 32, 42, 52,61/6 33, 13, 23, 33, 43, 53,61/6 44, 14, 24, 34, 44, 54,61/6 55, 15, 25, 35, 45, 55,61/6 66, 16, 26,36,46,56,61/6 The probability of 2 independent events A and B happening is: P(A,B) = P(A) P(B) = 1/361/6x A B , 11, 21, 31, 41, 51,61/6 22, 12, 22, 32, 42, 52,61/6 33, 13, 23, 33, 43, 53,61/6 44, 14, 24, 34, 44, 54,61/6 55, 15, 25, 35, 45, 55,61/6 66, 16, 26,36,46,56,61/ , 11, 21, 31, 41, 51,61/6 22, 12, 22, 32, 42, 52,61/6 33, 13, 23, 33, 43, 53,61/6 44, 14, 24, 34, 44, 54,61/6 55, 15, 25, 35, 45, 55,61/6 66, 16, 26,36,46,56,61/ , 11, 21, 31, 41, 51,61/6 22, 12, 22, 32, 42, 52,61/6 33, 13, 23, 33, 43, 53,61/6 44, 14, 24, 34, 44, 54,61/6 55, 15, 25, 35, 45, 55,61/6 66, 16, 26,36,46,56,61/6 eg P(1, 3) =

AB A and B The probability of A or B happening is P(A or B) = P(A) + P(B) – P(A,B) If not, must take away P(A, B) or it would be added twice: P(2 or 4)= If events can’t happen together, they are mutually exclusive (eg P(B=1,B=2) =0) and thus get probabilities by adding probs. of each event eg get the marginal probabilities (prob. of each event separately) such as P(A=1) by adding 6 times 1/36 = 1/ , 11, 21, 31, 41, 51,61/6 22, 12, 22, 32, 42, 52,61/6 33, 13, 23, 33, 43, 53,61/6 44, 14, 24, 34, 44, 54,61/6 55, 15, 25, 35, 45, 55,61/6 66, 16, 26,36,46,56,61/ , 11, 21, 31, 41, 51,61/6 22, 12, 22, 32, 42, 52,61/6 33, 13, 23, 33, 43, 53,61/6 44, 14, 24, 34, 44, 54,61/6 55, 15, 25, 35, 45, 55,61/6 66, 16, 26,36,46,56,61/ , 11, 21, 31, 41, 51,61/6 22, 12, 22, 32, 42, 52,61/6 33, 13, 23, 33, 43, 53,61/6 44, 14, 24, 34, 44, 54,61/6 55, 15, 25, 35, 45, 55,61/6 66, 16, 26,36,46,56,61/ , 11, 21, 31, 41, 51,61/6 22, 12, 22, 32, 42, 52,61/6 33, 13, 23, 33, 43, 53,61/6 44, 14, 24, 34, 44, 54,61/6 55, 15, 25, 35, 45, 55,61/6 66, 16, 26,36,46,56,61/ , 11, 21, 31, 41, 51,61/6 22, 12, 22, 32, 42, 52,61/6 33, 13, 23, 33, 43, 53,61/6 44, 14, 24, 34, 44, 54,61/6 55, 15, 25, 35, 45, 55,61/6 66, 16, 26,36,46,56,61/6 = 11/36 - 1/36+ 1/61/6

Conditional probabilities Use conditional probability: P(A|B) where B represents throwing the first die Probability of an event A happening if event B has already happened read P(A|B) as probability of A given B eg if we have already got a 5, probability that the sum is 10 is: P(A=10|B=5) = 1/6 as we must now get a 5 with the second die What if we want P(A=10) where A is sum of 2 dice but we have already thrown one die?

Eg: data classification. Classify person’s sex from their height male (m) female (f) height (cm) frequency < >180 x P(m)? 7 out of 12 in sample so P(m) =7/12 and P(f) = 5/12 Similarly, P(h= ) = 3/12 What sex is x?? if we didn’t know height would go for male as from our survey this is more likely: P(m)>P(f) However we know h = 167: P(m|h=167) = 1/3 < p(f|h=167) so go for female

Bayes Theorem Very (very) useful theorem: P(A|B) = P(B|A) P(A)/P(B) eg P(m|h=167) = P(h=167|m) P(m)/P(h=167) = 1/7 x 7/12 / 3/12 = 1/3 Used since it can be easier to evaluate P(B|A) (eg here we don’t need to know anything about females’ heights) male (m) female (f) height (cm) frequency < >180 x

Note for classification, just need whether P(m|h)>P(f|h) so compare P(h|m) P(m) with P(h|f) P(f) as P(h) just normalises the probabilities P(m) known as prior probability (or just prior) as it is the probability prior to getting any evidence/data P(m|h) known as the posterior probability Biggest problem in Bayesian inference is estimating the priors correctly and often they are simply assumed to be equal eg P(m)=P(f)=1/2 Very important as they encapsulate our prior knowledge of the problem and can have a very big impact on the outcome Often priors estimated iteratively (eg in SLAM) starting with no assumptions and incorporating evidence at each time-step

Probability Distributions Suppose we have a 2d binary random variable X (random variables written with a capital letter) where each dimension has equal probability of being 0 or 1. What are the different states (written as a small letter ie x)? B1B2X=(B1,B2)P(B1,B2) 00(0,0)0.5x0.5 = (0,1) (1,0) (1,1)0.25 The 3 rd and 4 th columns form the probability distribution of X What are the probabilities of each state (written as P(X=x))?

EG suppose Y = B1+B2. What are the different states of Y? B1B2Y=(B1+B2)P(B1,B2) 00? ? 10? 11? What is the probability distribution? Probability distribution tells us the probability of each state of X occurring Where possible it is given as a function

EG suppose Y = B1+B2. What are the different states? B1B2Y=(B1+B2)P(B1,B2) so the probability distribution of Y is Y=(B1+B2)P(Y=y)

Often view probability distributions graphically: Probability distribution of X X (0, 0)(0,1)(1,0)(1,1) P(X=x) 0.25 Probability distribution of Y Y 012 P(Y=y) Note that probabilities sum to 1.

Also, if we make each state (ie bar) width one, and put them together, can draw as a graph Y 012 P(Y=y) Y 012 P(Y=y) Probability of one or many states is area of bars ie area under graph Eg P(Y = 1 or 2) = 0.75

Probability Density Functions In the previous eg’s, both X and Y are discrete random variables as the states are discrete Again use a function but a continuous one Function known as a probability density function (pdf) Value f(x) indicates the likelihood of each height eg 175 is the commonest height What about continuous variables such as people’s heights??

As with discrete variables area under the curve gives the probability so if pdf is f(x) However, probability of anyone’s height being exactly 175 is 0. Instead get probability height is in a range of values eg between 173 and 175 Again area under curve must sum to 1

Properties/Parameters of Distributions Distributions and pdfs are often described/specified by parameters, commonly mean and variance The mean is the average value and is also known as the expected value E(X) or Variance governs the spread of the data and is the square of the standard deviation For n-dimensional data, also have the nxn covariance matrix where the ij’th element specifies the variance between the i’th and j’th dimensions

Common Distributions Have met the 2 most common: uniform and gaussian or normal Often used as the mutation operators in a GA, hill climbing etc Other distributions include binomial, multinomial, poisson etc Uniform distribution has the same probability for each point. Thus probability is governed by the range of the data R and pdf f(x) = 1/R = 1/10 = 0.1

Gaussian is governed by mean  and variance  2 with pdf: It is centred on  and width (and height) are governed by  2 red has  2 =1 blue has  2 =4

Properties of the Gaussian Gaussian may look nasty but has lots of nice properties including: Sum of gaussians is gaussian Conditional densities are gaussian Linear transformation is gaussian Easy to work with mathematically For a given mean and variance, entropy (see later) is maximised by gaussian distribution Central Limit theorem: sum/mean of m uniform random variables is gaussian

Entropy Doesn’t depend on values of X but how probability is spread Entropy can be formulated in many ways but can be seen as a measure of smoothness of a distribution For a discrete distribution, entropy is: and for continuous: Heavily used in information theory: the average amount of information needed to specify a variable’s value is the entropy

Can also be formulated as degree of surprise suppose probability of one state is 1 and all others = 0. Entropy is 0 as log(1) = 0 and 0 log (0) = 0 Intuitively makes sense as event MUST happen, so no information is passed and similarly no surprise in hearing it has happened Conversely, if variable is entirely random, entropy is large as we must specify all variables If log is to the base e information measured in nats, if log is to base 2 ie info is measured in bits

If X is a 1d binary variable, how many bits of information needed to specify it? So S(X)= -(0.5 (-1) (-1)) = 1 XP(X=x)log 2 (P(X=x)) 00.5 = = 2 -1 Ans =1. What is entropy? remember that log 2 (2 -n ) = -n

If X is a 2D binary variable, how many bits are needed to specify each state? S(X) = -(0.25 (-2) (-2) (-2) (-2) ) = -4(0.25 (-2)) = 2 For nD binary variable, have 2 n states what is probability of each? Ans =2. What is entropy? remember that log 2 (2 -n ) = -n B1B2X=(B1,B2)P(B1,B2) 00(0,0)0.25 = (0,1)0.25 = (1,0)0.25 = (1,1)0.25 = 2 -2 S(X) = -2 n (2 -n (-n)) = n Ans 2 -n. What about entropy/ no. of bits to specify state?

What about Y = B1+B2. Will entropy of Y be more or less than X? Y=(B1+B2)P(Y=y) = = = 2 -2 Ans:S(X) = -(0.25(-2) + 0.5(-1) (-2)) = 1.5 Less entropy than X as more order as more predictable eg if you had to guess Y you would say 1: 50:50 chance of being right whereas for X 2D random binary, only 25% chance What is entropy?

Which of X and Y will have higher entropy? Entropy of Y = 2.12 P(X=x)log 2 (P(X=x))P(Y=y)log 2 (P(Y=y)) Notice that the steeper distribution has less entropy Entropy of X= X Y

For a given mean and variance, entropy is maximised by gaussian distribution gaussian Super-gaussian