Presentation is loading. Please wait.

Presentation is loading. Please wait.

Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF.

Similar presentations


Presentation on theme: "Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF."— Presentation transcript:

1 Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF

2 Standard model of human evolution

3 Standard model of human evolution (Origin and spread of genus Homo) 2 – 2.5 Mya

4 Standard model of human evolution (Origin and spread of genus Homo) 1.6 – 1.8 Mya ? ?

5 Standard model of human evolution (Origin and spread of genus Homo) 0.8 – 1.0 Mya

6 Standard model of human evolution Origin and spread of ‘modern’ humans 150 – 200 Kya

7 Standard model of human evolution Origin and spread of ‘modern’ humans ~ 100 Kya

8 Standard model of human evolution Origin and spread of ‘modern’ humans 40 – 60 Kya

9 Standard model of human evolution Origin and spread of ‘modern’ humans 15 – 30 Kya

10 Estimating demographic parameters How can we quantify this qualitative scenario into an explicit model? How can we choose a model that is both biologically feasible as well as computationally tractable? How do we estimate parameters and quantify uncertainty in parameter estimates?

11 Estimating demographic parameters Calculating full likelihoods (under realistic models including recombination) is computationally infeasible So, compromises need to be made if one is interested in parameter estimation

12 African populations 10 populations 229 individuals

13 African populations San (bushmen) Biaka (pygmies) Mandenka (bantu) 61 autosomal loci ~ 350 Kb sequence data

14 A simple model of African population history T m g1g1 g2g2 Mandenka Biaka (or San)

15 Estimation method We use a composite-likelihood method (cf. Plagnol and Wall 2006) that uses information from the joint frequency spectrum such as: Numbers of segregating sites Numbers of shared and fixed differences Tajima’s D F ST Fu and Li’s D*

16 Estimation method We use a composite-likelihood method (cf. Plagnol and Wall 2006) that uses information from the joint frequency spectrum such as: Numbers of segregating sites Numbers of shared and fixed differences Tajima’s D F ST Fu and Li’s D*

17 Estimating likelihoods Pop1 Pop2

18 Estimating likelihoods Pop1 Pop2 Pop 1 private polymorphisms

19 Estimating likelihoods Pop1 Pop2 Pop 1 private polymorphisms Pop 2 private polymorphisms

20 Estimating likelihoods Pop1 Pop2 Pop 1 private polymorphisms Pop 2 private polymorphisms Shared polymorphisms

21 Estimation method We use a composite-likelihood method (cf. Plagnol and Wall 2006) that uses information from the joint frequency spectrum such as: Numbers of segregating sites Numbers of shared and fixed differences Tajima’s D F ST Fu and Li’s D*

22 Estimating likelihoods We assume these other statistics are multivariate normal. Then, we run simulations to estimate the means and the covariance matrix. This accounts (in a crude way) for dependencies across different summary statistics.

23 Composite likelihood We form a composite likelihood by assuming these two classes of summary statistics are independent from each other We estimate the (composite)-likelihood over a grid of values of g 1, g 2, T and M and tabulate the MLE. We also use standard asymptotic assumptions to estimate confidence intervals

24 Estimates (with 95% CI’s) ParameterMan-BiaMan-San g 1 (000’s)0 (0 – 3.8)0 (0 – 3.8) g 2 (000’s)4 (0 – 7.9)2 (0 – 11) T(000’s)450 (300 – 640)100 (77 – 550) M (= 4Nm)10 (8.4 – 12)3 (2.2 – 4)

25 Fit of the null model How well does the demographic null model fit the patterns of genetic variation found in the actual data?

26 Fit of the null model How well does the demographic null model fit the patterns of genetic variation found in the actual data? Quite well. The model accurately reproduces both parameters used in the original fitting (e.g., Tajima’s D in each population) as well as other aspects of the data (e.g., estimates of ρ = 4Nr)

27 Estimates (with 95% CI’s) ParameterMan-BiaMan-San g 1 (000’s)0 (0 – 3.8)0 (0 – 3.8) g 2 (000’s)4 (0 – 7.9)2 (0 – 11) T(000’s)450 (300 – 640)100 (77 – 550) M (= 4Nm)10 (8.4 – 12)3 (2.2 – 4)

28 Population growth time population size

29 Population growth time population size spread of agriculture and animal husbandry?

30 Estimates (with 95% CI’s) ParameterMan-BiaMan-San g 1 (000’s)0 (0 – 3.8)0 (0 – 3.8) g 2 (000’s)4 (0 – 7.9)2 (0 – 11) T(000’s)450 (300 – 640)100 (77 – 550) M (= 4Nm)10 (8.4 – 12)3 (2.2 – 4)

31 Ancestral structure in Africa At face value, these results suggest that population structure within Africa is old, and predates the migration of modern humans out of Africa. Is there any evidence for additional (unknown) ancient population structure within Africa?

32 Model of ancestral structure T m g1g1 g2g2 Mandenka Biaka (or San) Archaic human population

33 Standard model of human evolution Origin and spread of ‘modern’ humans ~ 100 Kya

34 Admixture mapping Modern human DNANeandertal DNA

35 Admixture mapping Modern human DNANeandertal DNA

36 Admixture mapping Modern human DNANeandertal DNA

37 Admixture mapping Modern human DNANeandertal DNA

38 Admixture mapping Modern human DNANeandertal DNA Orange chunks are ~10 – 100 Kb in length

39 Genealogy with archaic ancestry time present Modern humans Archaic humans

40 Genealogy without archaic ancestry time present Modern humans Archaic humans

41 Our main questions What pattern does archaic ancestry produce in DNA sequence polymorphism data (from extant humans)? How can we use data to –estimate the contribution of archaic humans to the modern gene pool (c)? –test whether c > 0?

42 Genealogy with archaic ancestry (Mutations added) time present Modern humans Archaic humans

43 Genealogy with archaic ancestry (Mutations added) time present Modern humans Archaic humans

44 Patterns in DNA sequence data Sequence 1 A T C C A C A G C T G Sequence 2 A G C C A C G G C T G Sequence 3 T G C G G T A A C C T Sequence 4 A G C C A C A G C T G Sequence 5 T G T G G T A A C C T Sequence 6 A G C C A T A G A T G Sequence 7 A G C C A T A G A T G

45 Patterns in DNA sequence data Sequence 1 A T C C A C A G C T G Sequence 2 A G C C A C G G C T G Sequence 3 T G C G G T A A C C T Sequence 4 A G C C A C A G C T G Sequence 5 T G T G G T A A C C T Sequence 6 A G C C A T A G A T G Sequence 7 A G C C A T A G A T G

46 Patterns in DNA sequence data Sequence 1 A T C C A C A G C T G Sequence 2 A G C C A C G G C T G Sequence 3 T G C G G T A A C C T Sequence 4 A G C C A C A G C T G Sequence 5 T G T G G T A A C C T Sequence 6 A G C C A T A G A T G Sequence 7 A G C C A T A G A T G We call the sites in red congruent sites – these are sites inferred to be on the same branch of an unrooted tree

47 Linkage disequilibrium (LD) LD is the nonrandom association of alleles at different sites. Low LD:ACHigh LD:AC ATAC ACAC ATAC GCGT GTGT GCGT GTGT High recombinationLow recombination

48 Measuring ‘congruence’ To measure the level of ‘congruence’ in SNP data from larger regions we define a score function S* = where S (i 1,... i k ) = and S (i j, i j+1 ) is a function of both congruence (or near congruence) and physical distance between i j and i j+1.

49 An example

50 An example (CHRNA4)

51 How often is S* from simulations greater than or equal to the S* value from the actual data?

52 An example (CHRNA4) How often is S* from simulations greater than or equal to the S* value from the actual data?p = 0.025

53 S* is sensitive to ancient admixture

54 General approach We use the model parameters estimated before (growth rates, migration rate, split time) as a demographic null model. Is our null model sufficient to explain the patterns of LD in the data? We test this by comparing the observed S* values with the distribution of S* values calculated from data simulated under the null model.

55 Distribution of p-values (Mandenka and San) p-value frequency

56 Distribution of p-values (Mandenka and San) p-value frequency Global p-value: 2.5 * 10 -5

57 Estimating ancient admixture rates The global p-values for S* are highly significant in every population that we’ve studied! If we estimate the ancient admixture rate in our (composite)-likelihood framework, we can exclude no ancient admixture for all populations studied.

58 A region on chromosome 4

59 19 mutations (from 6 Kb of sequence) separate 3 Biaka sequences from all of the other sequences in our sample. Simulations suggest this cannot be caused by recent population structure (p < 10 -3 ) This corresponds to isolation lasting ~1.5 million years!

60 Possible explanations Isolation followed by later mixing is a recurrent feature of human population history Mixing between ‘archaic’ humans and modern humans happened at least once prior to the exodus of modern humans out of Africa Some other feature of population structure is unaccounted for in our simple models

61 Acknowledgments Collaborators: Mike Hammer (U. of Arizona) Vincent Plagnol (Cambridge University) Samples: Foundation Jean Dausset (CEPH) Y chromosome consortium (YCC) Funding: National Science Foundation National Institutes for Health


Download ppt "Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF."

Similar presentations


Ads by Google