Goals: To identify subpopulations (subsets of the sample with distinct allele frequencies) To assign individuals (probabilistically) to subpopulations.

Slides:



Advertisements
Similar presentations
Attaching statistical weight to DNA test results 1.Single source samples 2.Relatives 3.Substructure 4.Error rates 5.Mixtures/allelic drop out 6.Database.
Advertisements

Introduction to Monte Carlo Markov chain (MCMC) methods
Evaluation of a new tool for use in association mapping Structure Reinhard Simon, 2002/10/29.
METHODS FOR HAPLOTYPE RECONSTRUCTION
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
MALD Mapping by Admixture Linkage Disequilibrium.
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
1 Bayesian Methods with Monte Carlo Markov Chains II Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University
Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Computational Statistical Genetics.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.
Traveling Salesman Problems Repetitive Nearest-Neighbor and Cheapest-Link Algorithms Chapter: 6.
Tutorial #9 by Ma’ayan Fishelson
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Approximate Inference 2: Monte Carlo Markov Chain
Bayesian Model Selection in Factorial Designs Seminal work is by Box and Meyer Seminal work is by Box and Meyer Intuitive formulation and analytical approach,
Probabilistic graphical models. Graphical models are a marriage between probability theory and graph theory (Michael Jordan, 1998) A compact representation.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University.
Lecture 13: Population Structure October 5, 2015.
Elementary Statistics (Math 145) September 8, 2010.
Lecture 14: Population structure and Population Assignment October 12, 2012.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lab 7. Estimating Population Structure. Goals 1.Estimate and interpret statistics (AMOVA + Bayesian) that characterize population structure. 2.Demonstrate.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Probabilistic Graphical Models seminar 15/16 ( ) Haim Kaplan Tel Aviv University.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Genetic Theory Pak Sham SGDP, IoP, London, UK. Theory Model Data Inference Experiment Formulation Interpretation.
Maximum-likelihood estimation of admixture proportions from genetic data Jinliang Wang.
Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.
Lab 7. Estimating Population Structure
1 Tutorial #9 by Ma’ayan Fishelson. 2 Bucket Elimination Algorithm An algorithm for performing inference in a Bayesian network. Similar algorithms can.
Exercise 1 DNA identification. To which population an individual belongs? Two populations of lab-mice have been accidentally put in a same cage. Your.
Math 145 January 29, Outline 1. Recap 2. Sampling Designs 3. Graphical methods.
Chapter 2: Bayesian hierarchical models in geographical genetics Manda Sayler.
Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?
Homework 3 Solutions Wayne Lawton Department of Mathematics S , Theme for Semester I, 2008/09 : The Logic of Evolution,
Math 145 May 27, 2009.
MCMC Output & Metropolis-Hastings Algorithm Part I
New Courses in the Fall Biodiversity -- Pennings
Math 145 June 25, 2013.
HARDY WEINBERG.
Population Genetics: Selection and mutation as mechanisms of evolution
L4: Counting Recombination events
Probabilistic Models for Linear Regression
Make an Organized List and Simulate a Problem
Allele Frequencies Genotype Frequencies The Hardy-Weinberg Equation
Patterns of Linkage Disequilibrium in the Human Genome
Bayesian Models in Machine Learning
Basic concepts on population genetics
Haplotype Reconstruction
Math 145.
Fractional Factorial Design
Vineet Bafna/Pavel Pevzner
STAT 145.
Math 145 January 28, 2015.
Lecture: Natural Selection and Genetic Drift and Genetic Equilibrium
AB AC AD AE AF 5 ways If you used AB, then, there would be 4 remaining ODD vertices (C, D, E and F) CD CE CF 3 ways If you used CD, then, there.
STAT 245.
Hardy - Weinberg Questions.
Math 145 September 3, 2008.
Math 145 May 23, 2016.
Hardy-Weinberg Lab Data
Presentation transcript:

Goals: To identify subpopulations (subsets of the sample with distinct allele frequencies) To assign individuals (probabilistically) to subpopulations

Ancestry models No Admixture: each individual is derived completely from a single subpopulation Admixture: individuals may have mixed ancestry: some fraction qk of the genome of individual i is derived from subpopulation k.

Typical data: Individual Locus 1 Locus 2 Locus 3 Locus 4 1 A,A A,A A,C A,A 2 A,B A,A A,B A,A 3 B,B A,B A,A A,A 4 C,C D,E D,E B,C 5 C,C C,D D,D B,D 6 B,C E,E A,E C,E 7 A,C D,D C,D A,D {A,B,C,D,E} are labels for the different alleles at each locus.

More on the model... Let P1, P2, …, PK represent the (unknown) allele frequencies in each subpopulation Let Z1, Z2, … , Zm represent the (unknown) subpopulation of origin of the sampled individuals (no admixture model). Let Xijk be the genotype data for allele copy k of individual i at locus j. Assuming Hardy-Weinberg and linkage equilibrium within subpopulations, the likelihood of an individual’s genotype in subpopulation k, Gi is given by the product of the relevant allele frequencies: Pr(Gi | Zi= k, Pk) = P p(k) p(k) Xij1 Xij2 loci j

S Pr(Gi | Pj, Zi= j) Pr(Zi= j) Then--adopting a Bayesian framework--we can write down the probability that individual i is from subpopulation k: Pr(Gi | Pk, Zi= k) Pr(Zi= k) S Pr(Gi | Pj, Zi= j) Pr(Zi= j) pops Here, Pr(Zi= k) gives the prior probability that individual i is from subpopulation k. Assigning the individual at random to a population according to the probabilities is an example of Gibbs sampling.

Similarly, a natural estimate of the allele frequencies in subpopulation k is: Frequency of allele j at locus l in pop k = # copies of allele j in individuals from k 2*(# individuals from subpopulation k) But because we are Bayesian, and doing MCMC, we sample from a posterior distribution for the frequency that also depends on the prior.

MCMC algorithm (for fixed K) Start at random initial values Z(0) for the population assignments. Then iterate the following steps for n=1,2,…. Step 1: Sample P(n) from Pr(P|Z(n-1) ,G) Step 2: Sample Z(n) from Pr(Z|P(n) ,G) For large n, P and Z will converge to the appropriate joint posterior distribution. Estimation of K performed separately (approximately)

Example: Taita Thrush data three main sampling locations in Kenya low migration rates (radio-tagging study) 155 individuals, genotyped at 7 microsatellite loci

Neighbor-joining tree of data

Since 2000 Model more features of heredity Infer more complex histories Faster algorithms Better data summaries More data