Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Probabilistic models Haixu Tang School of Informatics.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 11 th,
Uncertainty Everyday reasoning and decision making is based on uncertain evidence and inferences. Classical logic only allows conclusions to be strictly.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
1 Slides for the book: Probabilistic Robotics Authors: Sebastian Thrun Wolfram Burgard Dieter Fox Publisher: MIT Press, Web site for the book & more.
Parameter Estimation using likelihood functions Tutorial #1
Visual Recognition Tutorial
Lecture 5: Learning models using EM
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Thanks to Nir Friedman, HU
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Language Modeling Approaches for Information Retrieval Rong Jin.
Recitation 1 Probability Review
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 26 of 41 Friday, 22 October.
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Naive Bayes Classifier
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 24 of 41 Monday, 18 October.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
- 1 - Outline Introduction to the Bayesian theory –Bayesian Probability –Bayes’ Rule –Bayesian Inference –Historical Note Coin trials example Bayes rule.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Probabilistic Robotics Probability Theory Basics Error Propagation Slides from Autonomous Robots (Siegwart and Nourbaksh), Chapter 5 Probabilistic Robotics.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
ICS 280 Learning in Graphical Models
Review of Probabilities and Basic Statistics
CSE-490DF Robotics Capstone
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis
Parametric Methods Berlin Chen, 2005 References:
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Introduction to Bayesian statistics Yves Moreau

Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori Bayesian inference Multinomial en Dirichlet distributions Estimation of frequency matrices Pseudocounts Dirichlet mixture

The Cox-Jaynes axioms and Bayes’ rule

Probability vs. belief What is a probability? Frequentist point of view Probabilities are what frequency counts (coin, die) and histograms (height of people) Such definitions are somewhat circular because of the dependency on the Central Limit Theorem Measure theory point of view Probabilities satisfy Kolmogorov’s  -algebra axioms Rigorous definition fits well within measure and integration theory But definition is ad hoc to fit within this framework

Bayesian point of view Probabilities are models of the uncertainty regarding propositions within a given domain Induction vs. deduction Deduction IF ( A  B AND A = TRUE ) THEN B = TRUE Induction IF ( A  B AND B = TRUE ) THAN A becomes more plausible Probabilities satisfy Bayes’ rule

The Cox-Jaynes axioms The Cox-Jaynes axioms allow the buildup of a large probabilistic framework with minimal assumptions Firstly, some concepts A is a proposition A TRUE or FALSE D is a domain Information available about the current situation BELIEF:  ( A= TRUE | D ) Belief that we have regarding the proposition given the domain knowledge

Secondly, some assumptions 1.Suppose we can compare beliefs  (A| D ) >  (B| D )  A is more plausible than B given D and suppose the comparison is transitive We have an ordering relation, so  is a number

2.Suppose there exists a fixed relation between the belief in a proposition and the belief in the negation of this proposition 3.Suppose there exists a fixed relation between on the one hand the belief in the union of two propositions and on the other hand the belief in the first proposition and the belief in the second proposition given the first one

Bayes’ rule THEN it can be shown (after rescaling of the beliefs) that Bayes’ rule If we accept the Cox-Jaynes axions, we can always apply Bayes’ rule, independently of the specific definition of the probabilities

Bayes’ rule Bayes’ rule will be our main tool for building probabilistic models and to estimate them Bayes’ rule holds not only for statements (TRUE/FALSE) but for any random variables (discrete or continuous) Bayes’ rule holds for specific realizations of the random variables as well as for the whole distribution

Importance of the domain D The domain D is a flexible concept that encapsulates the background information that is relevant for the problem It is important to set up the problem within the right domain Example Diagnosis of Tay-Sachs’ disease Rare disease that appears more frequently for Ashkenazi Jews With the same symptoms, the probability of the disease will be smaller if we are in a hospital in Brussels that if we are in Mount Sinai Hospital in New York If we try to build a model with all the patients in the world, this model will not be more efficient

Probabilistic models and inference

Probabilistic models We have a domain D We have observations D We have a model M with parameters  Example 1 Domain D : the genome of a given organism Data D : a DNA sequence S = ’ACCTGATCACCCT’ Model M : the sequences are generated by a discrete distribution over the alphabet {A,C,G,T} Parameters  :

Example 2 Domain D : all European people Data D : the length of people from a given group Model M : the length is normally distributed N(m,  ) Parameters  : the mean m and the standard deviation 

Generative models It is often possible to set up a model of the likelihood of the data For example, for the DNA sequence More sophisticated models are possible HMMs Gibbs sampling for motif finding Bayesian networks We want to find the model that describes our observations

Maximum likelihood Maximum likelihood (ML) Consistent: if the observation were generated by the model M with parameters  *, then  ML will converge to  * when the number of observations goes to infinity Note that the data might not be generated by any instance of the model If the data set is small, there might be a large difference between  ML en  *

Maximum a posteriori probability Maximum a posteriori probability (MAP) Bayes’ rule Thus posterior likelihood of the data prior a priori knowledge plays no role in optimization over 

Posterior mean estimate

Distributions over parameters Let us look carefully to P(  |M) (or to P(  |D,M) ) P(  |M) is a probability distribution over the PARAMETERS We have to handle both distributions over observations and over parameters at the same time Example Distribution of the length of people P(D| ,M) Prior P(  |M) Length Mean length Standard deviation length

Bayesian inference If we want to update the probability of the parameters with new observations D 1.Choose a reasonable prior 2.Add the information from the data 3.Get the updated distributions of the parameters (We often work with logarithms) 1 3 2

Bayesian inference Example Mean length Mean length Mean length Belgian men 100 Dutch men

Marginalization A major technique for working with probabilistic models is to introduce or remove a variable through marginalization wherever appropriate If a variable Y can take only k mutually exclusive outcomes, we have If the variables are continuous

Multinomial and Dirichlet distributions

Multinomial distribution Discrete distribution K independent outcomes with probabilities  i Example Die K=6 DNA sequence K=4 Amino acid sequence K=20 For K=2 we have a Bernoulli variable (giving rise to a binomial distribution)

The multinomial distribution gives the number of times that the different outcomes were observed The multinomial distribution is the natural distribution for the modeling of biological sequences

Dirichlet distribution Distribution over the region of the parameter space where The distribution has parameters The Dirichlet distribution gives the probability of  The distribution is like a ‘dice factory’

Dirichlet distribution Z(  ) is a normalization factor such that  is de gamma function Generalization of the factorial function to real numbers The Dirichlet distribution is the natural prior for sequence analysis because this distribution is conjugate to the multinomial distribution, which means that if we have a Dirichlet prior and we update this prior with multinomial observations, the posterior will also have the form of a Dirichlet distribution Computationally very attractive

Estimation of frequency matrices Estimation on the basis of counts e.g., Position-Specific Scoring Matrix in PSI-BLAST Example: matrix model of a local motif GACGTG CTCGAG CGCGTG AACGTG CACGTG Count the number of instances in each column

If there are many aligned sites (N>>), we can estimate the frequencies as This is the maximum likelihood estimate for 

Proof We want to show that This is equivalent to Further

Pseudocounts If we have a limited number of counts, the maximum likelihood estimate will not be reliable (e.g., for symbols not observed in the data) In such a situation, we can combine the observations with prior knowledge Suppose we use a Dirichlet prior  : Let us compute the Bayesian update

Bayesian update =1 because both distributions are normalized Computation of the posterior mean estimate Normalization integral Z(.)

Pseudocounts The prior contributes to the estimation through pseudo- observations If few observations are available, then the prior plays an important role If many observations are available, then the pseudocounts play a negligible role

Dirichlet mixture Sometimes the observations are generated by a heterogeneous process (e.g., hydrophobic vs. hydrophilic domains in proteins) In such situations, we should use different priors in function of the context But we do not necessarily know the context beforehand A possibility is the use of a Dirichlet mixture The frequency parameter  can be generated from m different sources S with different Dirichlet parameters  k

Dirichlet mixture Posterior Via Bayes’ rule

Dirichlet mixture Posterior mean estimate The different components of the Dirichlet mixture are first considered as separate pseudocounts These components are then combined with a weight depending on the likelihood of the Dirichlet component

Summary The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori Bayesian inference Multinomial and Dirichlet distributions Estimation of frequency matrices Pseudocounts Dirichlet mixture