Parsimony, Likelihood, Common Causes, and Phylogenetic Inference

Slides:



Advertisements
Similar presentations
WHAT IS THE NATURE OF SCIENCE?
Advertisements

Evidence for Probabilistic Hypotheses: With Applications to Causal Modeling Malcolm R. Forster Department of Philosophy University of Wisconsin-Madison.
Chapter 9 Hypothesis Testing Understandable Statistics Ninth Edition
Hume’s Problem of Induction. Most of our beliefs about the world have been formed from inductive inference. (e.g., all of science, folk physics/psych)
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Chapter Seventeen HYPOTHESIS TESTING
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
14 Elements of Nonparametric Statistics
User Study Evaluation Human-Computer Interaction.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Lecture 3 Hypothesis Testing and Statistical Inference using Likelihood: The Central Role of Models Likelihood Methods in Ecology April , 2011 Granada,
Chapter 27: Hypotheses, Explanations, and Inference to the Best Explanation.
Transient Unterdetermination and the Miracle Argument Paul Hoyningen-Huene Leibniz Universität Hannover Center for Philosophy and Ethics of Science (ZEWW)
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
How Bad Is Oops?. When we make a decision while hypothesis testing (to reject or to do not reject the H O ) we determine which kind of error we have made.
Lecture 15 - Hypothesis Testing
HYPOTHESIS TESTING.
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Science 7-Scientific Explanations—Lesson 1
How to Research Lynn W Zimmerman, PhD.
Testing Hypotheses about a Population Proportion
CHAPTER 9 Testing a Claim
Chapter 9: Inferences Involving One Population
Dr.MUSTAQUE AHMED MBBS,MD(COMMUNITY MEDICINE), FELLOWSHIP IN HIV/AIDS
Chapter 21 More About Tests.
WHAT IS THE NATURE OF SCIENCE?
Testing Hypotheses About Proportions
Checking For Prior-Data Conflict
WHAT IS THE NATURE OF SCIENCE?
CHAPTER 9 Testing a Claim
Science 7-Scientific Explanations—Lesson 1
Multiple Alignment and Phylogenetic Trees
Inferential Statistics
Comparing Two Proportions
Chapter 6 Hypothesis tests.
What is Science? 8th Grade Science.
Chapter 9 Hypothesis Testing.
When will we see people of negative height?
Warm Up #1 What are 5 questions that you have about the world around you?
Testing Hypotheses about Proportions
More about Posterior Distributions
CHAPTER 9 Testing a Claim
Section 10.2: Tests of Significance
Model Comparison.
Comparing Two Proportions
CHAPTER 9 Testing a Claim
INTRODUCTION TO HYPOTHESIS TESTING
Propositional Logic.
Hypothesis Testing.
Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations Greenland et al (2016)
Introduction to Inductive Statistics
Testing Hypotheses About Proportions
Testing Hypotheses about a Population Proportion
Experimental Design: The Basic Building Blocks
Observation Use of one or more of the senses to gather information
Inferential Statistics
What determines Sex Ratio in Mammals?
Interpreting Epidemiologic Results.
CHAPTER 9 Testing a Claim
Testing Hypotheses about a Population Proportion
CHAPTER 9 Testing a Claim
Section 11.1: Significance Tests: Basics
FCAT Science Standard Arianna Medina.
Mathematical Foundations of BME Reza Shadmehr
CHAPTER 9 Testing a Claim
Inference for Distributions of Categorical Data
Chapter Ten: Designing, Conducting, Analyzing, and Interpreting Experiments with Two Groups The Psychologist as Detective, 4e by Smith/Davis.
Testing Hypotheses about a Population Proportion
Presentation transcript:

Parsimony, Likelihood, Common Causes, and Phylogenetic Inference Elliott Sober Philosophy Department University of Wisconsin, Madison

2 suggested uses of O’s razor O’s razor should be used to constrain the order in which hypotheses are to be tested. O’s razor should be used to interpret the acceptability/support of hypotheses that have already been tested. These are separate – both might be true. But there is a relation – the 2nd has work to do only if the results of testing is not that all or all but one of the hypotheses you’ve tested have been REJECTED. For the rest of my talk I’ll assume that this is the wrong picture of testing – that the evidence doesn’t refute all, or all but one, of the hypotheses on the table.

Pluralism about Ockham’s razor? [Pre-test] O’s razor should be used to constrain the order in which hypotheses are to be tested. [Post-test] O’s razor should be used to interpret the acceptability/support of hypotheses that have already been tested. These are separate – both might be true. But there is a relation – the 2nd has work to do only if the results of testing is not that all or all but one of the hypotheses you’ve tested have been REJECTED. For the rest of my talk I’ll assume that this is the wrong picture of testing – that the evidence doesn’t refute all, or all but one, of the hypotheses on the table.

these can be compatible, but… [Pre-test] O’s razor should be used to constrain the order in which hypotheses are to be tested. [Post-test] O’s razor should be used to interpret the acceptability/support of hypotheses that have already been tested. If pre-test O’s razor is “rejectionist,” then post-test O’s razor won’t have a point. And the contrapositive of this conditional will be relevant in the examples that follow… But there is a relation – the 2nd has work to do only if the results of testing is not that all or all but one of the hypotheses you’ve tested have been REJECTED. For the rest of my talk I’ll assume that this is the wrong picture of testing – that the evidence doesn’t refute all, or all but one, of the hypotheses on the table.

these can be compatible, but… [Pre-test] O’s razor should be used to constrain the order in which hypotheses are to be tested. [Post-test] O’s razor should be used to interpret the acceptability/support of hypotheses that have already been tested. If the pre-test idea involves testing hypotheses one at time, then it views testing as noncontrastive. And the contrapositive of this conditional will be relevant in the examples that follow… But there is a relation – the 2nd has work to do only if the results of testing is not that all or all but one of the hypotheses you’ve tested have been REJECTED. For the rest of my talk I’ll assume that this is the wrong picture of testing – that the evidence doesn’t refute all, or all but one, of the hypotheses on the table.

within the post-test category of support/plausibility ... Bayesianism – compute posterior probs. Likelihoodism – compare likelihoods. Frequentist model selection criteria like AIC – estimate predictive accuracy.

I am a pluralist about these broad philosophies … Bayesianism – compute posterior probs Likelihoodism – compare likelihoods Frequentist model selection criteria like AIC – estimate predictive accuracy.

I am a pluralist about these broad philosophies … Bayesianism – compute posterior probs Likelihoodism – compare likelihoods Frequentist model selection criteria like AIC – estimate predictive accuracy. Not that each is okay as a global thesis about all scientific inference …

I am a pluralist about these broad philosophies … Bayesianism – compute posterior probs Likelihoodism – compare likelihoods Frequentist model selection criteria like AIC – estimate predictive accuracy. But I do think that each has its place.

Ockham’s Razors* Different uses of O’s razor have different justifications and some have none at all. * “Let’s Razor Ockham’s Razor,” in D. Knowles (ed.), Explanation and Its Limits, Cambridge University Press, 1990, 73-94.

Parsimony and Likelihood In model selection criteria like AIC and BIC, likelihood and parsimony are conflicting desiderata. AIC(M) = log[Pr(Data│L(M)] - k For nested models, increasing parsimony will almost always reduce likelihood.

Parsimony and Likelihood In model selection criteria like AIC and BIC, likelihood and parsimony are conflicting desiderata. In other settings, parsimony has a likelihood justification.

the Law of Likelihood Observation O favors H1 over H2 iff Pr(O│H1) > Pr(O│H2) Term from Ian Hacking, LSI 1965. Likelihoodists don’t use prior probabilities; they don’t assess posterior probs, but just interpret the evidence at hand….

a Reichenbachian idea Salmon’s example of plagiarism E1 E2 E1 E2 C C1 C2 [Common Cause] [Separate Causes] 1 cause is more parsimonious than 2 causes.

a Reichenbachian idea Salmon’s example of plagiarism E1 E2 E1 E2 C C1 C2 [Common Cause] [Separate Causes] more parsimonious 1 cause is more parsimonious than 2 causes.

Reichenbach’s argument IF A cause screens-off its effects from each other All probabilities are non-extreme (≠ 0,1) a particular parameterization of the CC and SC models cause/effect relationships are “homogenous” across branches. THEN Pr[Data │Common Cause] > Pr[Data │Separate Causes].

parameters and homogeneity E1 E2 E1 E2 p1 p2 p1 p2 C C1 C2 [Common Cause] [Separate Causes] Parameterization: Pr(C=i) = Pr(C1 = i) =Pr(C2=i), for all . And on branches pi in CC is the same as pi in CSC … Homog concerns relation of p1 to p2.

Reichenbach’s argument IF A cause screens-off its effects from each other. All probabilities are non-extreme parameterization of the CC and SC models. cause/effect relationships are “homogenous” across branches. THEN Pr[Data │Common Cause] > Pr[Data │Separate Causes]. The more parsimonious hypothesis has the higher likelihood.

Reichenbach’s argument IF A cause screens-off its effects from each other. All probabilities are non-extreme parameterization of the CC and SC models. cause/effect relationships are “homogenous” across branches. THEN Pr[Data │Common Cause] > Pr[Data │Separate Causes]. Parsimony and likelihood are ordinally equivalent.

Some differences with Reichenbach I am comparing two hypotheses. I’m not using R’s Principle of the Common Cause. I take the evidence to be the matching of the students’ papers, not their “correlation.”

empirical foundations for likelihood ≈ parsimony A cause screens-off its effects from each other. All probabilities are non-extreme parameterization of the CC and SC models. cause/effect relationships are “homogenous” across branches. By adopting different assumptions, you can arrange for CC to be less likely than SC. Now likelihood and parsimony conflict!

empirical foundations for likelihood ≈ parsimony A cause screens-off its effects from each other. All probabilities are non-extreme parameterization of the CC and SC models. cause/effect relationships are “homogenous” across branches. Note: the R argument shows that these are sufficient for likelihood ≈ parsimony, not that they are necessary.

Parsimony in Phylogenetic Inference Two sources: ─ Willi Hennig ─ Luigi Cavalli-Sforza and Anthony Edwards Two types of inference problem: ─ find the best tree “topology” ─ estimate character states of ancestors Hennig = 1950, 1966. Phylogenetic Systematics.

1. Which tree topology is better? H C G H C G (HC)G H(CG) MP: (HC)G is better supported than H(CG) by data D if and only if (HC)G is a more parsimonious explanation of D than H(CG) is.

An Example of a Parsimony Calculation 1 1 0 1 1 0 H C G H C G 0 0 (HC)G H(CG) (HC)G is more parsimonious here because it requires fewer changes in character state.

2. What is the best estimate of the character states of ancestors in an assumed tree? 1 1 1 H C G A=?

2. What is the best estimate of the character states of ancestors in an assumed tree? 1 1 1 H C G A=? MP says that the best estimate is that A=1.

Maximum Likelihood H C G H C G (HC)G H(CG) ML: (HC)G is better supported than H(CG) by data D if and only if Pr[D│(HC)G] > Pr[D│H(CG)].

Maximum Likelihood H C G H C G (HC)G H(CG) ML: (HC)G is better supported than H(CG) by data D if and only if PrM[D│(HC)G] > PrM[D│H(CG)]. ML is “model dependent.” Biologists find the best fitting tree+model, using MLE for parameters

the present situation in evolutionary biology MP and ML sometimes disagree. The standard criticism of MP is that it assumes that evolution proceeds parsimoniously. The standard criticism of ML is that you need to choose a model of the evolutionary process.

When do parsimony and likelihood agree? (Ordinal Equivalence) For any data set D and any pair of phylogenetic hypotheses H1and H2, Pr(D│H1) > Pr(D│H2) iff H1 is a more parsimonious explanation of D than H2 is.

When do parsimony and likelihood agree? (Ordinal Equivalence) For any data set D and any pair of phylogenetic hypotheses H1and H2, PrM(D│H1) > PrM(D│H2) iff H1 is a more parsimonious explanation of D than H2 is. Whether likelihood agrees with parsimony depends on the probabilistic model of evolution used.

When do parsimony and likelihood agree? (Ordinal Equivalence) For any data set D and any pair of phylogenetic hypotheses H1and H2, PrM(D│H1) > PrM(D│H2) iff H1 is a more parsimonious explanation of D than H2 is. Whether likelihood agrees with parsimony depends on the probabilistic model of evolution used. Felsenstein (1973) showed that the postulate of very low rates of evolution suffices for ordinal equivalence.

Does this mean that parsimony assumes that rates are low? NO: the assumptions of a method are the propositions that must be true if the method correctly judges support.

Does this mean that parsimony assumes that rates are low? NO: the assumptions of a method are the propositions that must be true if the method correctly judges support. Felsenstein showed that the postulate of low rates suffices for ordinal equivalence, not that it is necessary for ordinal equivalence.

Tuffley and Steel (1997) T&S showed that the postulate of “no-common-mechanism” also suffices for ordinal equivalence. “no-common-mechanism” means that each character on each branch is subject to its own drift process.

the two probability models of evolution Felsenstein Tuffley and Steel Rates of change are low, but not necessarily equal. Drift not assumed: Pr(i  j) and Pr(j  i) may differ. Rates of change can be high. Drift is assumed: Pr(i  j) = Pr(j  i)

How to use likelihood to define what it means for parsimony to assume something The assumptions of parsimony = the propositions that must be true if parsimony correctly judges support. For a likelihoodist, parsimony correctly judges support if and only if parsimony is ordinally equivalent with likelihood. Hence, for a likelihoodist, parsimony assumes any proposition that follows from ordinal equivalence.

A Test for what Parsimony does not assume Model M ordinal equivalence A where A = what parsimony assumes

A Test for what Parsimony does not assume Model M ordinal equivalence A where A = what parsimony assumes If model M entails ordinal equivalence, and M entails proposition X, X may or may not be an assumption of parsimony.

A Test for what Parsimony does not assume Model M ordinal equivalence A where A = what parsimony assumes If model M entails ordinal equivalence, and M entails proposition X, X may or may not be an assumption of parsimony. If model M entails ordinal equivalence, and M does not entail proposition X, then X is not an assumption of parsimony.

applications of the negative test T&S’s model does not entail that rates of change are low; hence parsimony does not assume that rates are low. F’s model does not assume neutral evolution; hence parsimony does not assume neutrality.

How to figure out what parsimony does assume? Find a model that forces parsimony and likelihood to disagree about some example. Then, if parsimony is right in what it says about the example, the model must be false.

Example #1 Task: Infer the character state of the MRCA of species that all exhibit the same state of a quantitative character. 10 10 … 10 10 A=? The MP estimate is A=10. When is A=10 the ML estimate? And when is it not?

Answer 10 10 … 10 10 A=? ML says that A=10 is the best estimate (and thus agrees with MP) if there is neutral evolution or selection is pushing each lineage towards a trait value of 10.

Answer 10 10 … 10 10 A=? ML says that A=10 is the best estimate (and thus agrees with MP) if there is neutral evolution or selection is pushing each lineage towards a trait value of 10. ML says that A=10 is not the best estimate (and thus disagrees with MP) if (*) selection is pushing all lineages towards a single trait value different from 10.

Answer 10 10 … 10 10 A=? ML says that A=10 is the best estimate (and thus agrees with MP) if there is neutral evolution or selection is pushing each lineage towards a trait value of 10. ML says that A=10 is not the best estimate (and thus disagrees with MP) if (*) selection is pushing all lineages towards a single trait value different from 10. So: Parsimony assumes, in this problem, that (*) is false.

Example #2 Task: Infer the character state of the MRCA of two species that exhibit different states of a dichotomous character. 1 0 A=? A=0 and A=1 are equally parsimonious. When are they equally likely? And when are they unequally likely?

Answer 1 0 A=? ML agrees with MP that A=0 and A=1 are equally good estimates if the same neutral process occurs in the two lineages.

Answer 1 0 A=? ML agrees with MP that A=0 and A=1 are equally good estimates if the same neutral process occurs in the two lineages. ML disagrees with MP if (*) the same selection process occurs in both lineages.

Answer 1 0 A=? ML agrees with MP that A=0 and A=1 are equally good estimates if the same neutral process occurs in the two lineages. ML disagrees with MP if (*) the same selection process occurs in both lineages. So: Parsimony assumes, in this problem, that (*) is false. If selection favors state 1, what is the ML estimate of A? It is A=0

Conclusions about phylogenetic parsimony ≈ likelihood The assumptions of parsimony are the propositions that must be true if parsimony correctly judges support.

Conclusions about phylogenetic parsimony ≈ likelihood The assumptions of parsimony are the propositions that must be true if parsimony correctly judges support. To find out what parsimony does not assume, use the test described [M  ordinal equivalence  A]. If M does not entail X, then X is not an assumption of parsimony.

Conclusions about phylogenetic parsimony ≈ likelihood The assumptions of parsimony are the propositions that must be true if parsimony correctly judges support. To find out what parsimony does not assume, use the test described [M  ordinal equivalence  A]. To find out what parsimony does assume, look for examples in which parsimony and likelihood disagree, not for models that ensure that they agree.

Conclusions about phylogenetic parsimony ≈ likelihood The assumptions of parsimony are the propositions that must be true if parsimony correctly judges support. To find out what parsimony does not assume, use the test described [M  ordinal equivalence  A]. To find out what parsimony does assume, look for examples in which parsimony and likelihood disagree, not for models that ensure that they agree. Maybe parsimony’s assumptions vary from problem to problem.

broader conclusions underdetermination: O’s razor often comes up when the data don’t settle truth/falsehood or acceptance/rejection.

broader conclusions underdetermination: O’s razor often comes up when the data don’t settle truth/falsehood or acceptance/rejection. reductionism: when O’s razor has authority, it does so because it reflects some other, more fundamental, desideratum.

broader conclusions underdetermination: O’s razor often comes up when the data don’t settle truth/falsehood or acceptance/rejection. reductionism: when O’s razor has authority, it does so because it reflects some other, more fundamental, desideratum. [But there isn’t a single global justification.]

broader conclusions underdetermination: O’s razor often comes up when the data don’t settle truth and falsehood. reductionism: when O’s razor has authority, it does so because it reflects some other, more fundamental, desideratum. two questions: When parsimony has a precise meaning, we can investigate: What are its presuppositions? What suffices to justify it?

[Common Cause] [Separate Causes] A curiosity: in the R argument, to get a difference in likelihood, the hypotheses should not specify the states of the causes. E1 E2 E1 E2 p1 p2 p1 p2 C C1 C2 [Common Cause] [Separate Causes] An oddity of this argument for likelihood // parsimony.

Example #0 Task: Infer the character state of the MRCA of species that all exhibit the same state of a dichotomous character. 1 1 … 1 1 A=? The MP inference is that A=1. When is A=1 the ML inference?

Example #0 Task: Infer the character state of the MRCA of species that all exhibit the same state of a dichotomous character. 1 1 … 1 1 A=? The MP inference is that A=1. When is A=1 the ML inference? Answer: when lineages have finite duration and the process is Markovian. It doesn’t matter whether selection or drift is the process at work.