Scales and probability measures The states of a random variable can be given on different scales 1) Nominal scale A scale where the states have no numerical interrelationships Example: The colour of a sampled pill from a seizure of suspected illicit drug pills Each state can be assigned a probability > 0 2) Numerical scale a) Discrete states (i) Ordinal scale A scale where the states can be put in ascending order Example: Classification of a dental cavity as small, medium- sized or large
Each state can be assigned a probability > 0 Once probabilities have been assigned it is also meaningful to interpret statements as “at most”, “at least”, “smaller than”, “larger than”….. If we denote the random variable by X assigned state probabilities would be written Pr (X = a ) and we can also interpret Pr (X a ), Pr (X a ), Pr (X a ) (ii) Interval scale An ordinal scale where the distance between two consecutive states is the same no matter where in the scale we are. Example: The number of gun shot residues found on the hands of a person suspected to have fired a gun. The distance between 5 and 4 is the same as the distance between 125 and 124 Probabilities are assigned and interpreted the same way as for an ordinal scale.
Interval scale discrete random variables very often fit into a family of discrete probability distributions where the assignment consists of choosing one or several parameters. Probabilities can be written on parametric form using a probability mass function, e.g. if X denotes the random variable: Examples: Binomial distribution: Typical application: The number of “successes” out of n independent trials where for each trial the assigned probability of success is Poisson distribution: Typical application: Count data, e.g. the number of times an event occur in a fixed time period where is the expected number of counts in that period.
b) Continuous states (i) Interval scale This scale is of the same kind as for discrete states Example: Daily temperature in Celsius degrees However, a probability > 0 cannot be assigned to a particular state Instead probabilities can be assigned to intervals of states The whole range of states has probability one. The probability of an interval of states depends on the assigned probability density function for the range of states. Denote the random variable by X. It is thus only meaningful to assign probabilities like Pr ( a < X < b ) [which is equal to Pr ( a X b ) ]. Such probabilities are obtained by integrating the assigned density function (see further below)
(ii) Ratio scale An interval scale with a well-defined zero state. Example: Measurements of weight and length The probability measure is the same as for continuous interval scale random variables The probability density function and probabilities: The random variable, X, is almost always assumed to belong to a family of continuous probability distributions The density function is then specified on parametric form: and probabilities for intervals of states are computed as integrals:
Examples: 1) Normal distribution, N (165,6.4) proxy for the length of a randomly selected adult woman E.g. Pr(150 < X < 160) is calculated as the area under the curve between 150 and 160 (i.e. an integral)
2) Gamma distribution [Gamma (k, )] with k = 2 and = 4 (might be a proxy for the time until death for an organism) E.g. the probability that the time exceeds 0.5 (for the scaling used) is Pr ( X > 0.5) and is the area under the curve from 0.5 to infinity.
Probability and Likelihood Two synonyms? An event can be likely or probable, which for most people would be the same. Yet, the definitions of probability and likelihood are different. In a simplified form: The probability of an event measures the degree of belief that this event is true and is used for reasoning about not yet observed events The likelihood of an event is a measure of how likely that event is in light of another observed event. Both uses probability calculus
More formally… Consider the unobserved event A and the observed event B. There are probabilities for both representing the degrees of belief for these events in general: However, as B is observed we might be interested in which measures the updated degree of belief that A is true once we know that B holds. Still a probability, though. How interesting is
Pr (B | A, I ) might look meaningless to consider as we have actually observed B. However, it says something about A. We have observed B and if A is relevant for B we may compare Pr (B | A, I ) with Pr (B | Ā, I ). Now, even if we have not observed A or Ā, one of them must be true (as a consequence of A and B being relevant for each other) If Pr (B | A, I ) > Pr (B | Ā, I ) we may conclude that A is more likely to have occurred than is Ā, or better phrased: “A is a better explanation to why B has occurred than is Ā” Pr (B | A, I ) is called the likelihood of A given the observed B (and Pr (B | Ā, I ) is the likelihood of Ā ). Note! This is different from the conditional probability of A given B: Pr (A | B, I )
Extending… The likelihood represents what the observed event(s) or data says about A The probability represents what the model says about A (with our without conditioning on data) The likelihood needs not necessarily be a strict probability expression. If the data consists of continuous measurements (interval or ratio scale), no distinct probability can be assigned to a specific value, but the specific value might be the event of interest. Instead, the randomness of an event is measured through the probability density function where x (usually) stands for a specific value.
Example: Suppose you have observed a person at quite a long distance and you try to determine the sex of this person. Your observations are the following: 1) The person was short in length 2) The person’s skull was shaved Based on observation 1 only your provisional conclusion would be that it is a woman. This is so because women in general are shorter than men. The likelihood for the event “It is a woman” is the density for women’s lengths evaluated at the length of this person.
Based on observation 2 only your provisional conclusion would be that it is a man. This is so because more men than women have shaved skulls. The likelihood here for the event “It is a woman” is the proportion of women that have shaved skulls. Note that it is different to consider how big is the proportion of women among those persons that have the same length as the person of interest. Note that it is different to consider the proportion of women among persons with shaved skulls
What if we combine observations 1 and 2? Provided we can assume that a person’s length is not relevant for whether the person’s skull is shaved or not, the likelihood for “It is a woman” in view of the combined observations is the product of the individual likelihoods Note that it would be even more problematic to consider the proportion women among those person’s that have the same length as the person of interest and a shaved skull This might lead to a combined likelihood that is equally large for both events (It is a woman and It is a man )
The general definition of likelihood: Assume we have a number of unobserved events A, B, … and some observed data. The observed data can be one specific value (or state) of a variable, x, or a collection of values (states) A probability model can be used that can either assign a distinct probability to the observed data Pr(x | I ). This is the case when there is either an enumerable set of possible values or when the observed data is a continuous interval of values or evaluate the density of the observed data f (x | I ). This is the case when x is a continuous variable
The likelihood of A given the data is The likelihood ratio of A versus B given the data is LR > 1 A is a better explanation than is B for the observed data
Example: Return to the example with detection of dye on bank notes. Unobserved event is A = “Dye is present” Observed event, Data is B = “Method gives positive result” A positive result makes the event “Dye is present” a better explanation than the event “Dye is absent”
Potential danger in mixing things up: When we say that an event is the more likely one in light of data we do not say that this event has the highest probability. Using the likelihood as a measure of how likely is an event is a matter of inference to the best explanation. Logics: Implication: A B If A is true then B is true, i.e. Pr(B | A, I ) 1 If B is false then A is false, i.e. If B is true we cannot say anything about whether A is true or not (implication is different from equivalence)
“Probabilistic implication”: If A is true then B may be true, i.e. Pr(B | A, I ) > 0 If B is false the A may still be true, i.e. If B is true then we may decide which of A and Ā that is the best explanation Inference to the best explanation: B is observed A 1, A 2, …, A m are potential alternative explanations to B If for each j k Pr(B | A k, I ) > Pr(B | A j, I ) then A k is considered the best explanation for B and is provisionally accepted
Bayesian hypothesis testing In an inferential setup we may work with propositions or hypotheses. A hypothesis is a central component in the building of science and forensic science is no exception. Successive falsification of hypotheses (cf. Popper) is an important component of crime investigation. The “standard situation” would be that we have two hypotheses: H 0 The forwarded hypothesis H 1 The alternative hypothesis These must be mutually exclusive
Classical statistical hypothesis testing (Neyman J. and Pearson E.S., 1933) The two hypotheses are different explanations to the Data. Each hypothesis provides model(s) for Data The purpose is to use Data to try to falsify H 0 Type-I-error: Falsifying a true H 0 Type-II-error: Not falsifying a false H 0 Size or Significance level: = Pr(Type-I-error) If each hypothesis provides one and only one model for Data: Power:1 – Pr(Type-II-error) = 1 – The hypothesis are then referred to as simple
Most powerful test for simple hypotheses (Neyman-Pearson lemma): where A > 0 is chosen so that Minimises for fixed . Note that the probability is taken with respect to Data, i.e. with respect to the probability model each hypothesis provides for Data. Extension to composite hypotheses: Uniformly most powerful test (UMP)
Example: A seizure of pills, suspected to be Ecstasy, is sampled for the purpose of investigating whether the proportion of Ecstasy pills is “around” 80% or “around” 50%. In a sample of 50 pills, 39 proved to be Ecstasy pills As the forwarded hypothesis we can formulate H 0 : Around 80% of the pills in the seizure are Ecstasy and as the alternative hypothesis H 1 : Around 50% of the pills in the seizure are Ecstasy
The likelihood of the two hypotheses are L (H 0 | Data) = Probability of obtaining 39 Ecstasy pills out of 50 sampled when the seizure proportion of Ecstasy pills is 80% L (H 1 | Data) = Probability of obtaining 39 Ecstasy pills out of 50 sampled when the seizure proportion of Ecstasy pills is 50% Assuming a large seizure these probabilities can be calculated using a binomial model Bin(50, p ), where H 0 states that p = p 0 = 0.8 and H 1 states that p = p 1 = 0.5. In generic form, if we have obtained x Ecstasy pills out of n sampled:
The Neyman-Pearson lemma now states that the most powerful test is of the form Hence, H 0 should be rejected in favour of H 1 as soon as x B How to choose B?
Normally, we would set the significance level and the find B so that If is chosen to 0.05 we can search the binomial distribution valid under H 0 for a value B such that MSExcel: BINOM.INV(50;0.8;0.05) returns the lowest value of B for which the sum is at least 0.05 35 BINOM.DIST(35;50;0.8;TRUE) BINOM.DIST(34;50;0.8;TRUE) Choose B = 34. Since x = 39 we cannot reject H 0
Drawbacks with the classical approach “Isolated” falsification (or no falsification) – Tests using other data but with the same hypotheses cannot be easily combined Data alone “decides”. Small amounts of data Low power Difficulties in interpretation: When H 0 is rejected, it means “If we repeat the collection of data under (in principal) identical circumstances then in (at most) 100 % of all cases” Can we (always) repeat the collection of data? “Falling off the cliff” – What is the difference between “just rejecting” and “almost rejecting” ?
The Bayesian Approach There is always a process that leads to the formulation of the hypotheses. There exist a prior probability for each of them: Non-informative priors: p 0 = p 1 = 0.5 gives prior odds = 1 Simpler expressed as prior odds for the hypothesis H 0 :
Data should help us calculate posterior odds The “hypothesis testing” is then a judgement upon whether q 0 is small enough to make us believe in H 1 large enough to make us believe in H 0 i.e. no pre-setting of the decision direction is made
The odds ratio (posterior odds/prior odds) is know as the Bayes factor: How can be obtain the posterior odds? Hence, if we know the Bayes factor, we can calculate the posterior odds (since we can always set the prior odds)
1.Both hypotheses are simple, i.e. give one and only one model each for Data a)Distinct probabilities can be assigned to Data Bayes’ theorem on odds-form then gives Hence, the Bayes factor is The probabilities of the numerator and denominator respectively can be calculated (estimated) using the model provided by respective hypothesis.
b)Data is the observed value x of a continuous (possibly multidimensional) random variable It can be shown that where f (x | H 0, I ) and f (x | H 1, I ) are the probability density functions given by the models specified by H 0 and H 1. Hence, the Bayes factor is Known (or estimated) density functions under each model can then be used to calculate the Bayes factor
In both cases we can see that the Bayes factor is a likelihood ratio since the numerator and denominator are likelihoods for respective hypothesis. Example Ecstasy pills revisited The likelihoods for the hypotheses are Hence, Data are 3831 times more probable if H 0 is true compared to if H 1 is true.
Assume we have no particular belief in any of the two hypothesis prior to obtaining the data Hence, upon the analysis of data we can be 99.97% certain that H 0 is true. Note however that it may be unrealistic to assume only two possible proportions of Ecstasy pills in the seizure!
2.The hypothesis H 0 is simple but the hypothesis H 1 is composite, i.e. it provides several models for Data (several explanations) The various models of H 1 would (in general) provide different likelihoods for the different explanations We cannot come up with one unique likelihood for H 1. If in addition, the different explanations have different prior probabilities we have to weigh the different likelihoods with these. If the composition in H 1 is in form of a set of discrete alternatives, the Bayes factor can be written where P(H 1i | H 1 ) is the conditional prior probability that H 1i is true given that H 1 is true (relative prior), and the sum is over all alternatives H 11, H 12, …
If the relative priors are (fairly) equal the denominator reduces to the average likelihood of the alternatives. If the likelihoods of the alternatives are equal the denominator reduces to that likelihood since the relative priors sum to one. If the composition is defined by a continuously valued parameter, we must use conditional prior density of given that H 1 is true: p( |H 1 ) and integrate the likelihood with respect to that density. The Bayes factor can be written
3.Both hypothesis are composite, i.e. each provides several models for Data (several explanations) This gives different sub-cases, depending on whether the compositions in the hypotheses are discrete or according to a continuously valued parameter. The “discrete-discrete” case gives the Bayes factor and the “continuous-continuous” case gives the Bayes factor where p( | H 0 ) is the conditional prior density of given that H 0 is true
Example Ecstasy pills revisited again Assume a more realistic case where we from a sample of the seizure shall investigate whether the proportion of Ecstasy pills is higher than 80%. H 0 : Proportion > 0.8 H 1 : Proportion 0.8 We further assume like before that we have no particular belief in any of the two hypotheses. The prior density for can thus be defined as i.e. both are composite
The likelihood function is (irrespective of the hypotheses) The conditional prior densities under each hypothesis become uniform over each interval of potential values of ( (0.8, 1] and [0,0.8] ). The Bayes factor is How do we solve these integrals?
The Beta distribution: A random variable is said to have a Beta distribution with parameters a and b if its probability density function is Hence, we can identify the integrals of the Bayes factor as proportional to different probabilities of the same beta distribution namely a beta distribution with parameters a = 40 and b =12
> num<-1-pbeta(0.8,40,12) > den<-pbeta(0.8,40,12) > num [1] > den [1] > B<-num/den > B [1] Hence, the Bayes factor is With even prior odds (Odds(H 0 ) = 1) we get the posterior odds equal to the Bayes factor and the posterior probability of H 0 is Data does not provide us with evidence clearly against any of the hypotheses.