Evidentiary strength of a rare haplotype match: What is the right number? Charles Brenner, PhD DNA·VIEW and UC Berkeley Public Health www.dna-view.com.

Slides:



Advertisements
Similar presentations
High Resolution studies
Advertisements

Probability.
Forensic DNA Inference ICFIS 2008 Lausanne, Switzerland Mark W Perlin, PhD, MD, PhD Joseph B Kadane, PhD Robin W Cotton, PhD Cybergenetics ©
Elementary Statistics for Lawyers References Evett and Weir, Interpreting DNA evidence. Balding, Weight-of-evidence for forensic DNA profiles.
Attaching statistical weight to DNA test results 1.Single source samples 2.Relatives 3.Substructure 4.Error rates 5.Mixtures/allelic drop out 6.Database.
Database Searches Non-random samples of N individuals Typically individuals convicted of some crime Maryland, people arrested but not convicted.
Bayesian Statistics: Asking the Right Questions Michael L. Raymer, Ph.D.
How strong is DNA evidence?
A small taste of inferential statistics
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
Chapter 7 Hypothesis Testing
Samples The means of these samples
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)
Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir.
Statistical Inferences Based on Two Samples
CHAPTER 15: Tests of Significance: The Basics Lecture PowerPoint Slides The Basic Practice of Statistics 6 th Edition Moore / Notz / Fligner.
Chapter 11: The t Test for Two Related Samples
Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Multiple Regression and Model Building
Probability and Induction
1 Hypothesis Testing Chapter 8 of Howell How do we know when we can generalize our research findings? External validity must be good must have statistical.
Beyond Null Hypothesis Testing Supplementary Statistical Techniques.
DNA fingerprinting Every human carries a unique set of genes (except twins!) The order of the base pairs in the sequence of every human varies In a single.
Fundamentals of Forensic DNA Typing Slides prepared by John M. Butler June 2009 Appendix 3 Probability and Statistics.
Forward Genealogical Simulations Assumptions:1) Fixed population size 2) Fixed mating time Step #1:The mating process: For a fixed population size N, there.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
One Chance in a Million: An equilibrium Analysis of Bone Marrow Donation Ted Bergstrom, Rod Garratt Damien Sheehan-Connor.
IGES 2003 How many markers are necessary to infer correct familial relationships in follow-up studies? Silvano Presciuttini 1,3, Chiara Toni 2, Fabio Marroni.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Probability and Statistics of DNA Fingerprinting.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Experimental Evaluation
Copyright c 2001 The McGraw-Hill Companies, Inc.1 Chapter 7 Sampling, Significance Levels, and Hypothesis Testing Three scientific traditions critical.
Forensic Statistics From the ground up…. Basics Interpretation Hardy-Weinberg equations Random Match Probability Likelihood Ratio Substructure.
Hypothesis Testing:.
Biodiversity IV: genetics and conservation
1 When the population (seizure, consignment) is too large to be analyzed in its entirety: – because of limitations in time and/or resources (personnel,
Combinatorial Reconstruction of Sibling Relationships in Absence of Parental Data Tanya Y Berger-Wolf (DIMACS and UIC CS) Bhaskar DasGupta (UIC CS) Wanpracha.
Evidence Based Medicine
Nederlands Forensisch Instituut Observed and expected numbers of (partially) randomly matching profiles in the Dutch DNA database,
DNA evidence The DNA Double Helix Consists of so-called nucleobases always in pairs A-T, C-G. One part of the pair is inherited from the mother, the other.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Population Genetics I. Basic Principles. Population Genetics I. Basic Principles A. Definitions: - Population: a group of interbreeding organisms that.
Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University.
Thinking About DNA Database Searches William C. Thompson Dept. of Criminology, Law & Society University of California, Irvine.
Comp. Genomics Recitation 3 The statistics of database searching.
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
Allele Frequencies: Staying Constant Chapter 14. What is Allele Frequency? How frequent any allele is in a given population: –Within one race –Within.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 7 Sampling, Significance Levels, and Hypothesis Testing Three scientific traditions.
The Hardy-Weinberg principle is like a Punnett square for populations, instead of individuals. A Punnett square can predict the probability of offspring's.
Inferential Statistics Introduction. If both variables are categorical, build tables... Convention: Each value of the independent (causal) variable has.
Forensic DNA Analysis Basic Review 46 chromosomes per cell, 23 pairs Humans have approximately 25,000 genes Each gene has multiple versions,
Sequence Alignment.
1 Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
1 Chapter 4, Part 1 Basic ideas of Probability Relative Frequency, Classical Probability Compound Events, The Addition Rule Disjoint Events.
Chapter 23: Evaluation of the Strength of Forensic DNA Profiling Results.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
The Haplotype Blocks Problems Wu Ling-Yun
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
Statistical Weights of DNA Profiles
Adventures in Forensic Mathematics
Distorting DNA evidence: methods of math distraction
JS 115 Validation Pre class activities Database issues- Continued
The Y-haplotype geography problem
Forensic significance and Population structure based on the 11-loci SWGDAM recommended Y-STR haplotypes in some Nigerian Population.
Chapter 9 Hypothesis Testing.
Presentation transcript:

Evidentiary strength of a rare haplotype match: What is the right number? Charles Brenner, PhD DNA·VIEW and UC Berkeley Public Health [FP]Brenner CH (2010)Fundamental problem of forensic mathematics – The evidential value of a rare haplotype Forensic Sci. Int. Genet –291

The problem  crime scene Y-haplotype. Call it S. Imagine a suspect matches. How strong is the evidence that he is the donor? –In particular, suppose S is previously unobserved in the reference database. –When we lose our familiar crutch of sample frequency as an estimator for population frequency, what can we use instead?

Mathematical formulation of “Evidential value of a match” Where do we start? –B Weir: Likelihood Ratio –Simplest problem first LR= 1 / Pr(match crime scene Y haplotype S | random suspect) Problem is then to evaluate the denominator probability –Think prospectively: Given the crime scene type S, how surprised will I be if a random (i.e. innocent) man matches?

Suspect matches crime scene haplotype. Relevant number? Relevant number is the matching probability, the probability that a random suspect would match the crime scene type given available data of crime scene type & population database and general scientific knowledge. Innocent suspect is the test. Probability is the issue. Data means information that we have. Is there another kind?

General scientific knowledge Some version (simplified/selective) of “scientific knowledge” constitutes a model of reality. Matching probability can be derived given, and only given, an adequate model. –Model must be valid (close enough to reality) Models I have considered include: (1998) “Infinite alleles” → β prior (Ewens `72). Couldn’t validate satisfactorily. (1998) Ω t (many equally rare alleles) Couldn’t be sure it’s not anti-conservative. (2008 & today) “Equal over-representation” κ method

General scientific knowledge Growth of a (Y-)haplotype “database” (population sample) Kappa = proportion of singletons κ=0.9

Y-STR efficacy random match probability ≈ 1/ (N≈1000) eliminates all false leads (e.g. familial searching) US Caucasian1/8900 US Black1/14000 US Asian1/4100 Y-haplotype matching odds for US populations (Yfiler) Note: If n<5000, a “confidence interval”, e.g. 1.65/n proponent, is in the absurd position suggesting that the matching probability to a new type is significantly less than the above. The empirical match probability from the database per above is practically the whole story. If sample frequency of S is unknown (someone lost the database), it is the whole story. If S is a new type, can refine them down a little. ☜ Otherwise (infrequent occurrence), match probability is larger.

Y-filer population sample data size=# of chromosomes α=# of singletons (types not repeated) κ= α/size, proportion of sample that is singleton Sizeακ=α/nκ=α/n1/(1−κ) (“inflation factor”) US Black Asian Caucasian Example Dn−1n−1 α0.910

Quiz: Probability of new type? Assume the Example Y-haplotype database. κ=90% of the chromosomes are singletons. –Assume κ changes only slowly as D grows. What is the probability that the next person sampled has a NEW type? Answer: κ (90%), the same as the probability the last one added was new. H. Robbins, Ann Math Stat 1968 Corollary: κ of the population is not represented in the database. Corollary: 1- κ (e.g. 10%) = probability new observation (i.e. crime scene type) IS represented in the database. –Equivalently: For any type in the database, sample frequency typically over-represents population frequency by 1/(1- κ). Modeling assumption: especially for the singletons!

Pr(match) – analysis Construct the ExtendedDatabase of size n by including the crime stain S (condition on S). – ExtendedDatabase has α ≈ κn singletons: S=S 0, S 1, S 2, S 3, …, S α-1 Innocent suspect arrested, with haplotype T. We want Pr(match) = Pr(T=S). –Modeling assumption: No information from type. –Same as Pr(T=S i ) for any i. (Same information/evidence, so same probability) Same unrelatedness to innocent suspect. Obtain in 3 steps.

Pr(match) – 3 part calculation AT is in ExtendedDatabasePr(A)=1−κ BT=S i for some singleton S i in the ExtendedDatabase Pr(B|A)≤κ CT=S (=S 0 )Pr(C|B&A)=1/nκ Pr(C) =Pr(C&B&A) =Pr(C|B&A)·Pr(B|A)·Pr(A) ≤ (1−κ)/n. Assume T is type of innocent suspect 1/n reference sample D of n types non-singletons singletons S

So … Pr(T=S) ≈ (1−κ)/n Imagine κ=90%. Then Pr(T=S) ≈ 1/10n. LR = 1/Pr(T=S) ≈ 10n is the odds against a random match, the strength of evidence against a matching suspect. 1/(1−κ) – equal to 10 in this example – is the inflation factor, the factor by which the matching LR exceeds the simple counting rule estimate.

Not so fast! Check assumptions. 1.Model assumption #1: No information from type. 2.My derivation that Pr(T=S)≤(1−κ)/n relies on a subtle modeling assumption – –The singletons in the database over-represent their population proportion by (at least) as much as the non-singletons do. Checking: extensive population simulations.

Validation of the “κ method” Valid: LR κ ≈ 1 / E(freq(S) | S is singleton) (Expectation is taken over all singleton observations.) 3% population size, mutation rate κ model relative error sample size population growth/generation 27 simulated model populations span the realistic range of size, growth, mutation rate. For each sample size n=300, 1000, …, many samples drawn. All singletons’ pop’n freqs were compared with the κ formula. Looks ok.

(forensic) mathematical exposition Features / paradigm  State problem  Formulate it mathematically  State premises  What is the model?  Justify the premises  Validate the model.  Test=innocent suspect  Derive the result Benefits Communicate Explain accurately Logical; persuasive  Linear deductive organization Facilitate discussion/argument  Where do we disagree?  Premises? Reasoning step?  Resolution

Rare haplotype matching probability Features / paradigm  State problem  Formulate it mathematically dummy line  State premises  What is the model?  Justify the premises  Validate the model.  Test=innocent suspect  Derive the result Brenner paper [FP]  Evidential value of match?  Pr(innocent suspect matches | crime scene, database)  Type is (mostly) just a name  “equal over-representation”  Validation by tediously simulating/examining suitable range of populations  LR=n/(1-κ) (for new type)  n=reference database size+1  κ=singleton proportion

criticisms [BKW] claim: κ method’s “type=arbitrary name” approach ignores “substantial information” from the repeat lengths. –My approach can be extended to include whatever information. I merely began with the simplest model. –“Substantial information” sounds confident. It’s a plausible guess but from my research it is wrong. –κ method, uniquely, has been shown to be valid. * [BKW]J.S. Buckleton, M. Krawczak, B.S. Weir, The interpretation of lineage markers in forensic DNA testing, FSI Genetics (2011) 5, 78-83

“we have shown …” – where? [BKW]: “as we have shown, Brenner’s approach … suffers from potential anti-conservativeness in the way it inherently estimates haplotype frequencies.” –(Hey! It’s “matching probability”, not “haplotype frequency”!) Shown where? Three possible answers 1.Dead end attempt at analysis 2.Invalid counterexample 3.Algebraic blunder Conclusion: Nothing “shown.”

1. Dead-end criticism BKW: Pursues a hopeful line of analysis, constructing an alternative expression for the value of my formula … … and get stuck – it “is a complex function...difficult to judge … if, and to what extent … ” Too bad the line of analysis didn’t pan out. (Lots of mine don’t either.) –Why imagine a dead-end is evidence κ method is wrong (or right)? –Why publish something pointless?

2. Invalid counterexample 1.In [FP] I construct an artificial population Ω t (many exactly equally rare types) where my method would not work. Ω t : 2.Reason – to explain that A.κ method doesn’t claim to be a mathematical identity B.but rather depends on evolutionary mechanisms – on reality, C.hence the example motivates the need for validation. 3.The validation shows that the method works in reality. [BKW] cites my example as counterexample to my method! –Misunderstand 2 & overlooked 3. –In particular [BKW] says the opposite of 2A. ☞ ☞ … (t=1000 types)

3. Criticism by mistake Notation: Sample of n haplotypes. p=probability particular type=A Easy algebra: – Pr(particular type ≠ A) = 1-p – Pr( 0 = # observations of A in sample) = (1-p) n – Pr( 0 < # observations of A in sample) =1-(1-p) n [BKW]: Pr( 1 < # observations of A in sample) =1-(1-p) n –If true that would (in the context) imply that the κ method has a counter-intuitive consequence. Pointless (since counter-intuitive ≠ wrong) if so. –But since 1≠0, it’s not even true.

Assessment of validity Result: LR κ =n/(1-κ) is a reasonable assessment of the evidence that a matching haplotype suspect is the donor when the crime scene haplotype is unseen in a database. The paper [FP] derives and validates the formula in a coherent, linear deductive presentation, the appropriate framework for discussion including criticism. –Known criticisms make no sense. –Better to assess the logic of the paper & see if and exactly where there is a flaw or disagreement.

Final comments 1.Test is the innocent suspect, e.g. probability that an random suspect would match the crime scene type 2.(Matching) probability is not (haplotype) frequency (inference from data; no confidence intervals) 3.Condition on the crime scene type (toss into database. No more “0 count”.) 4.Sample frequency may not approximate probability LR can be >> sample size LR = 1/Pr(T=S) κ method: LR ≈ n/(1−κ)for a new type.

The end The rules of genetics are simple. Their consequences are not always obvious. This work received no support from the NIJ, IMF, World Bank, Bill and Melinda Gates, or the Ford Foundation. Even Queen Isabella, traditionally a soft touch, didn’t pitch in.

Understanding Y haplotypes 1.Evolutionary history and population genetics 2.Evidential value

 All men alive today have a common Y- chromosome ancestor  (probably 3,000 generations ago)

 Two men have the same Yfiler haplotype.  Connected to a common ancestor without mutation (IBD), or not?  (Terminology: ◦ IBD = Identity by descent = related with no intervening mutations ◦ IBS = Identity by state = same haplotype maybe coincidentally)

Y-haplotype lineage “ Adam ” mutation Convergent mutation (rare) “Time’s winged chariot” Same color = same Y-haplotype

Convergent Y mutation Y haplotype = 17 numbers = position in 17-space Mutation is random walk in 17 dimensions –Each step is +1 or -1 in some dimension. 2 × 17 =34 Random walks rarely return to start. –2 mutation separation: 1/34 chance that 2 nd mutation reverses 1 st one. –Probability to converge otherwise is negligible. Identical Y-filer haplotype => relationship to common ancestor without mutations (IBD)

Convergence experiment Simulated Y-filer population (N=90000) Small proportion of pair-wise matches –Pr(match)= 1/9000 Given match (IBS), are all IBD? –Pr(IBD | IBS) = 33/34 (experimental, from simulation) –Close to computed estimate of non-convergence (previous slide). (Why? They are not the same experiment.)

Time to diverge μ ≈ 1/350 per locus per generation (1/150-1/3000) μ ≈ 5% per generation (17 loci) Suppose 4 generations / century –Common ancestor century ago = 3 rd cousins –8 meioses per century of separation between two contemporary men Pr( Y’s equal after 1 century) = 70% Expected # differences = 4/millenium.

Y-haplotype divergence Expected # differences  virtual non overlap of races

Comments on crime-suspect match If suspect not donor, then 97% that suspect and donor are IBD which is very unlikely if they are separated by more than a few centuries. A 17-locus haplotype is typically represented by a small number* of men descended without mutation from a common ancestor 300 years ago. Probably way beyond immediate family. * 1/10000 of the population

 Example: 1272 Caucasian men (ABI) ◦ pairwise comparisons (big sample!)  90% of 1272 men are singletons (no pairwise matches)  49 pairs of matching haplotypes (49 matches)  5 triples (5×3=15 pairwise matches) ◦ … in total 91 pairwise matches / ◦ Pairwise matching rate 1/8900  Can evidential strength (new type) be less than that? (no matter what the “upper confidence” limit may be)

 Assume Y-filer (17 STR loci)  Probability in an actual database? ◦ Example: 1272 Caucasian men (ABI sample)  90% are “singletons”  Smaller database ◦ If n=1, 100% singletons  Suppose we collect the entire world male population. What % of singletons?