Scales and probability measures The states of a random variable can be given on different scales 1) Nominal scale A scale where the states have no numerical.

Slides:



Advertisements
Similar presentations
Introductory Mathematics & Statistics for Business
Advertisements

Tests of Hypotheses Based on a Single Sample
Chapter 7 Hypothesis Testing
Brief introduction on Logistic Regression
Lecture XXIII.  In general there are two kinds of hypotheses: one concerns the form of the probability distribution (i.e. is the random variable normally.
Statistics.  Statistically significant– When the P-value falls below the alpha level, we say that the tests is “statistically significant” at the alpha.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Inference Sampling distributions Hypothesis testing.
Chapter 10: Hypothesis Testing
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Hypothesis testing Some general concepts: Null hypothesisH 0 A statement we “wish” to refute Alternative hypotesisH 1 The whole or part of the complement.
Cal State Northridge  320 Ainsworth Sampling Distributions and Hypothesis Testing.
Introduction to Hypothesis Testing
Topic 2: Statistical Concepts and Market Returns
Statistical Background
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Experimental Evaluation
Inferences About Process Quality
BCOR 1020 Business Statistics
Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Sample Size Determination Ziad Taib March 7, 2014.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 11 Introduction to Hypothesis Testing.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
The Neymann-Pearson Lemma Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  ) where  is either  1 or  2. Let g(x 1, …,
The problem of sampling error in psychological research We previously noted that sampling error is problematic in psychological research because differences.
AM Recitation 2/10/11.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Fundamentals of Hypothesis Testing: One-Sample Tests
Chapter 5 Sampling and Statistics Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Statistical Decision Theory
Statistical Review We will be working with two types of probability distributions: Discrete distributions –If the random variable of interest can take.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Lecture 16 Dustin Lueker.  Charlie claims that the average commute of his coworkers is 15 miles. Stu believes it is greater than that so he decides to.
IE241: Introduction to Hypothesis Testing. We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
Economics 173 Business Statistics Lecture 4 Fall, 2001 Professor J. Petry
Lecture 18 Dustin Lueker.  A way of statistically testing a hypothesis by comparing the data to values predicted by the hypothesis ◦ Data that fall far.
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Lecture 17 Dustin Lueker.  A way of statistically testing a hypothesis by comparing the data to values predicted by the hypothesis ◦ Data that fall far.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Hypothesis Testing An understanding of the method of hypothesis testing is essential for understanding how both the natural and social sciences advance.
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
"Classical" Inference. Two simple inference scenarios Question 1: Are we in world A or world B?
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
© Copyright McGraw-Hill 2004
Inferential Statistics Inferential statistics allow us to infer the characteristic(s) of a population from sample data Slightly different terms and symbols.
Understanding Basic Statistics Fourth Edition By Brase and Brase Prepared by: Lynn Smith Gloucester County College Chapter Nine Hypothesis Testing.
1 Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Chapter Nine Hypothesis Testing.
Lecture Nine - Twelve Tests of Significance.
CONCEPTS OF HYPOTHESIS TESTING
Chapter 9 Hypothesis Testing.
Chapter 9 Hypothesis Testing.
Discrete Event Simulation - 4
Parametric Methods Berlin Chen, 2005 References:
STA 291 Spring 2008 Lecture 17 Dustin Lueker.
Presentation transcript:

Scales and probability measures The states of a random variable can be given on different scales 1) Nominal scale A scale where the states have no numerical interrelationships Example: The colour of a sampled pill from a seizure of suspected illicit drug pills Each state can be assigned a probability > 0 2) Numerical scale a) Discrete states (i) Ordinal scale A scale where the states can be put in ascending order Example: Classification of a dental cavity as small, medium- sized or large

Each state can be assigned a probability > 0 Once probabilities have been assigned it is also meaningful to interpret statements as “at most”, “at least”, “smaller than”, “larger than”…..  If we denote the random variable by X assigned state probabilities would be written Pr (X = a ) and we can also interpret Pr (X  a ), Pr (X  a ), Pr (X a ) (ii) Interval scale An ordinal scale where the distance between two consecutive states is the same no matter where in the scale we are. Example: The number of gun shot residues found on the hands of a person suspected to have fired a gun. The distance between 5 and 4 is the same as the distance between 125 and 124 Probabilities are assigned and interpreted the same way as for an ordinal scale.

Interval scale discrete random variables very often fit into a family of discrete probability distributions where the assignment consists of choosing one or several parameters.  Probabilities can be written on parametric form using a probability mass function, e.g. if X denotes the random variable: Examples: Binomial distribution: Typical application: The number of “successes” out of n independent trials where for each trial the assigned probability of success is  Poisson distribution: Typical application: Count data, e.g. the number of times an event occur in a fixed time period where is the expected number of counts in that period.

b) Continuous states (i) Interval scale This scale is of the same kind as for discrete states Example: Daily temperature in Celsius degrees However, a probability > 0 cannot be assigned to a particular state Instead probabilities can be assigned to intervals of states The whole range of states has probability one. The probability of an interval of states depends on the assigned probability density function for the range of states. Denote the random variable by X. It is thus only meaningful to assign probabilities like Pr ( a < X < b ) [which is equal to Pr ( a  X  b ) ]. Such probabilities are obtained by integrating the assigned density function (see further below)

(ii) Ratio scale An interval scale with a well-defined zero state. Example: Measurements of weight and length The probability measure is the same as for continuous interval scale random variables The probability density function and probabilities: The random variable, X, is almost always assumed to belong to a family of continuous probability distributions The density function is then specified on parametric form: and probabilities for intervals of states are computed as integrals:

Examples: 1) Normal distribution, N (165,6.4) proxy for the length of a randomly selected adult woman E.g. Pr(150 < X < 160) is calculated as the area under the curve between 150 and 160 (i.e. an integral)

2) Gamma distribution [Gamma (k,  )] with k = 2 and  = 4 (might be a proxy for the time until death for an organism) E.g. the probability that the time exceeds 0.5 (for the scaling used) is Pr ( X > 0.5) and is the area under the curve from 0.5 to infinity.

Probability and Likelihood Two synonyms? An event can be likely or probable, which for most people would be the same. Yet, the definitions of probability and likelihood are different. In a simplified form: The probability of an event measures the degree of belief that this event is true and is used for reasoning about not yet observed events The likelihood of an event is a measure of how likely that event is in light of another observed event. Both uses probability calculus

More formally… Consider the unobserved event A and the observed event B. There are probabilities for both representing the degrees of belief for these events in general: However, as B is observed we might be interested in which measures the updated degree of belief that A is true once we know that B holds. Still a probability, though. How interesting is

Pr (B | A, I ) might look meaningless to consider as we have actually observed B. However, it says something about A. We have observed B and if A is relevant for B we may compare Pr (B | A, I ) with Pr (B | Ā, I ). Now, even if we have not observed A or Ā, one of them must be true (as a consequence of A and B being relevant for each other) If Pr (B | A, I ) > Pr (B | Ā, I ) we may conclude that A is more likely to have occurred than is Ā, or better phrased: “A is a better explanation to why B has occurred than is Ā” Pr (B | A, I ) is called the likelihood of A given the observed B (and Pr (B | Ā, I ) is the likelihood of Ā ). Note! This is different from the conditional probability of A given B: Pr (A | B, I )

Extending… The likelihood represents what the observed event(s) or data says about A The probability represents what the model says about A (with our without conditioning on data) The likelihood needs not necessarily be a strict probability expression. If the data consists of continuous measurements (interval or ratio scale), no distinct probability can be assigned to a specific value, but the specific value might be the event of interest. Instead, the randomness of an event is measured through the probability density function where x (usually) stands for a specific value.

Example: Suppose you have observed a person at quite a long distance and you try to determine the sex of this person. Your observations are the following: 1) The person was short in length 2) The person’s skull was shaved Based on observation 1 only your provisional conclusion would be that it is a woman. This is so because women in general are shorter than men. The likelihood for the event “It is a woman” is the density for women’s lengths evaluated at the length of this person.

Based on observation 2 only your provisional conclusion would be that it is a man. This is so because more men than women have shaved skulls. The likelihood here for the event “It is a woman” is the proportion of women that have shaved skulls. Note that it is different to consider how big is the proportion of women among those persons that have the same length as the person of interest. Note that it is different to consider the proportion of women among persons with shaved skulls

What if we combine observations 1 and 2? Provided we can assume that a person’s length is not relevant for whether the person’s skull is shaved or not, the likelihood for “It is a woman” in view of the combined observations is the product of the individual likelihoods Note that it would be even more problematic to consider the proportion women among those person’s that have the same length as the person of interest and a shaved skull This might lead to a combined likelihood that is equally large for both events (It is a woman and It is a man )

The general definition of likelihood: Assume we have a number of unobserved events A, B, … and some observed data. The observed data can be one specific value (or state) of a variable, x, or a collection of values (states) A probability model can be used that can either assign a distinct probability to the observed data  Pr(x | I ). This is the case when there is either an enumerable set of possible values or when the observed data is a continuous interval of values or evaluate the density of the observed data  f (x | I ). This is the case when x is a continuous variable

The likelihood of A given the data is The likelihood ratio of A versus B given the data is LR > 1  A is a better explanation than is B for the observed data

Example: Return to the example with detection of dye on bank notes. Unobserved event is A = “Dye is present” Observed event, Data is B = “Method gives positive result”  A positive result makes the event “Dye is present” a better explanation than the event “Dye is absent”

Potential danger in mixing things up: When we say that an event is the more likely one in light of data we do not say that this event has the highest probability. Using the likelihood as a measure of how likely is an event is a matter of inference to the best explanation. Logics: Implication: A  B  If A is true then B is true, i.e. Pr(B | A, I )  1 If B is false then A is false, i.e. If B is true we cannot say anything about whether A is true or not (implication is different from equivalence)

“Probabilistic implication”: If A is true then B may be true, i.e. Pr(B | A, I ) > 0 If B is false the A may still be true, i.e. If B is true then we may decide which of A and Ā that is the best explanation Inference to the best explanation: B is observed A 1, A 2, …, A m are potential alternative explanations to B If for each j  k Pr(B | A k, I ) > Pr(B | A j, I ) then A k is considered the best explanation for B and is provisionally accepted

Bayesian hypothesis testing In an inferential setup we may work with propositions or hypotheses. A hypothesis is a central component in the building of science and forensic science is no exception. Successive falsification of hypotheses (cf. Popper) is an important component of crime investigation. The “standard situation” would be that we have two hypotheses: H 0 The forwarded hypothesis H 1 The alternative hypothesis These must be mutually exclusive

Classical statistical hypothesis testing (Neyman J. and Pearson E.S., 1933) The two hypotheses are different explanations to the Data.  Each hypothesis provides model(s) for Data The purpose is to use Data to try to falsify H 0 Type-I-error: Falsifying a true H 0 Type-II-error: Not falsifying a false H 0 Size or Significance level:  = Pr(Type-I-error) If each hypothesis provides one and only one model for Data: Power:1 – Pr(Type-II-error) = 1 –  The hypothesis are then referred to as simple

Most powerful test for simple hypotheses (Neyman-Pearson lemma): where A > 0 is chosen so that Minimises  for fixed . Note that the probability is taken with respect to Data, i.e. with respect to the probability model each hypothesis provides for Data. Extension to composite hypotheses: Uniformly most powerful test (UMP)

Example: A seizure of pills, suspected to be Ecstasy, is sampled for the purpose of investigating whether the proportion of Ecstasy pills is “around” 80% or “around” 50%. In a sample of 50 pills, 39 proved to be Ecstasy pills As the forwarded hypothesis we can formulate H 0 : Around 80% of the pills in the seizure are Ecstasy and as the alternative hypothesis H 1 : Around 50% of the pills in the seizure are Ecstasy

The likelihood of the two hypotheses are L (H 0 | Data) = Probability of obtaining 39 Ecstasy pills out of 50 sampled when the seizure proportion of Ecstasy pills is 80% L (H 1 | Data) = Probability of obtaining 39 Ecstasy pills out of 50 sampled when the seizure proportion of Ecstasy pills is 50% Assuming a large seizure these probabilities can be calculated using a binomial model Bin(50, p ), where H 0 states that p = p 0 = 0.8 and H 1 states that p = p 1 = 0.5. In generic form, if we have obtained x Ecstasy pills out of n sampled:

The Neyman-Pearson lemma now states that the most powerful test is of the form Hence, H 0 should be rejected in favour of H 1 as soon as x  B How to choose B?

Normally, we would set the significance level  and the find B so that If  is chosen to 0.05 we can search the binomial distribution valid under H 0 for a value B such that MSExcel: BINOM.INV(50;0.8;0.05) returns the lowest value of B for which the sum is at least 0.05  35 BINOM.DIST(35;50;0.8;TRUE)  BINOM.DIST(34;50;0.8;TRUE)   Choose B = 34.  Since x = 39 we cannot reject H 0

Drawbacks with the classical approach “Isolated” falsification (or no falsification) – Tests using other data but with the same hypotheses cannot be easily combined Data alone “decides”. Small amounts of data  Low power Difficulties in interpretation: When H 0 is rejected, it means “If we repeat the collection of data under (in principal) identical circumstances then in (at most) 100  % of all cases” Can we (always) repeat the collection of data? “Falling off the cliff” – What is the difference between “just rejecting” and “almost rejecting” ?

The Bayesian Approach There is always a process that leads to the formulation of the hypotheses.  There exist a prior probability for each of them: Non-informative priors: p 0 = p 1 = 0.5 gives prior odds = 1 Simpler expressed as prior odds for the hypothesis H 0 :

Data should help us calculate posterior odds The “hypothesis testing” is then a judgement upon whether q 0 is small enough to make us believe in H 1 large enough to make us believe in H 0 i.e. no pre-setting of the decision direction is made

The odds ratio (posterior odds/prior odds) is know as the Bayes factor: How can be obtain the posterior odds?  Hence, if we know the Bayes factor, we can calculate the posterior odds (since we can always set the prior odds)

1.Both hypotheses are simple, i.e. give one and only one model each for Data a)Distinct probabilities can be assigned to Data Bayes’ theorem on odds-form then gives Hence, the Bayes factor is The probabilities of the numerator and denominator respectively can be calculated (estimated) using the model provided by respective hypothesis.

b)Data is the observed value x of a continuous (possibly multidimensional) random variable It can be shown that where f (x | H 0, I ) and f (x | H 1, I ) are the probability density functions given by the models specified by H 0 and H 1. Hence, the Bayes factor is Known (or estimated) density functions under each model can then be used to calculate the Bayes factor

In both cases we can see that the Bayes factor is a likelihood ratio since the numerator and denominator are likelihoods for respective hypothesis.  Example Ecstasy pills revisited The likelihoods for the hypotheses are Hence, Data are 3831 times more probable if H 0 is true compared to if H 1 is true.

Assume we have no particular belief in any of the two hypothesis prior to obtaining the data Hence, upon the analysis of data we can be 99.97% certain that H 0 is true. Note however that it may be unrealistic to assume only two possible proportions of Ecstasy pills in the seizure!

2.The hypothesis H 0 is simple but the hypothesis H 1 is composite, i.e. it provides several models for Data (several explanations) The various models of H 1 would (in general) provide different likelihoods for the different explanations  We cannot come up with one unique likelihood for H 1. If in addition, the different explanations have different prior probabilities we have to weigh the different likelihoods with these. If the composition in H 1 is in form of a set of discrete alternatives, the Bayes factor can be written where P(H 1i | H 1 ) is the conditional prior probability that H 1i is true given that H 1 is true (relative prior), and the sum is over all alternatives H 11, H 12, …

If the relative priors are (fairly) equal the denominator reduces to the average likelihood of the alternatives. If the likelihoods of the alternatives are equal the denominator reduces to that likelihood since the relative priors sum to one. If the composition is defined by a continuously valued parameter,  we must use conditional prior density of  given that H 1 is true: p(  |H 1 ) and integrate the likelihood with respect to that density.  The Bayes factor can be written

3.Both hypothesis are composite, i.e. each provides several models for Data (several explanations) This gives different sub-cases, depending on whether the compositions in the hypotheses are discrete or according to a continuously valued parameter. The “discrete-discrete” case gives the Bayes factor and the “continuous-continuous” case gives the Bayes factor where p(  | H 0 ) is the conditional prior density of  given that H 0 is true

Example Ecstasy pills revisited again Assume a more realistic case where we from a sample of the seizure shall investigate whether the proportion of Ecstasy pills is higher than 80%.  H 0 : Proportion  > 0.8 H 1 : Proportion   0.8 We further assume like before that we have no particular belief in any of the two hypotheses. The prior density for  can thus be defined as i.e. both are composite

The likelihood function is (irrespective of the hypotheses) The conditional prior densities under each hypothesis become uniform over each interval of potential values of  ( (0.8, 1] and [0,0.8] ). The Bayes factor is How do we solve these integrals?

The Beta distribution: A random variable is said to have a Beta distribution with parameters a and b if its probability density function is Hence, we can identify the integrals of the Bayes factor as proportional to different probabilities of the same beta distribution namely a beta distribution with parameters a = 40 and b =12

> num<-1-pbeta(0.8,40,12) > den<-pbeta(0.8,40,12) > num [1] > den [1] > B<-num/den > B [1] Hence, the Bayes factor is With even prior odds (Odds(H 0 ) = 1) we get the posterior odds equal to the Bayes factor and the posterior probability of H 0 is  Data does not provide us with evidence clearly against any of the hypotheses.