Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF.

Slides:



Advertisements
Similar presentations
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Advertisements

Background The demographic events experienced by populations influence their genealogical history and therefore the pattern of neutral polymorphism observable.
Recombination and genetic variation – models and inference
Sampling distributions of alleles under models of neutral evolution.
A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University.
Lecture 23: Introduction to Coalescence April 7, 2014.
Patterns of population structure and admixture among human populations Katarzyna Bryc OEB 275br February 19, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
Atelier INSERM – La Londe Les Maures – Mai 2004
Signatures of Selection
Forward Genealogical Simulations Assumptions:1) Fixed population size 2) Fixed mating time Step #1:The mating process: For a fixed population size N, there.
Islands in Africa: a study of structure in the source population for modern humans Rosalind Harding Depts of Statistics, Zoology & Anthropology, Oxford.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
Tracing the dispersal of human populations By analysis of polymorphisms in the Non-recombining region of the Human Y Chromosome Underhill et al 2000 Nature.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Polymorphism Structure of the Human Genome Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Scott Williamson and Carlos Bustamante
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC.
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
Chapter 11: Inference for Distributions
Out-of-Africa Theory: The Origin Of Modern Humans
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Hidenki Innan and Yuseob Kim Pattern of Polymorphism After Strong Artificial Selection in a Domestication Event Hidenki Innan and Yuseob Kim A Summary.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
Simon Myers, Gil McVean Department of Statistics, Oxford Recombination and genetic variation – models and inference.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
MStruct: A New Admixture Model for Inference of Population Structure in Light of Both Genetic Admixing and Allele Mutations Suyash Shringarpure and Eric.
Confidence intervals and hypothesis testing Petter Mostad
Getting Parameters from data Comp 790– Coalescence with Mutations1.
Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.
Coalescent Models for Genetic Demography
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
INTRODUCTION TO ASSOCIATION MAPPING
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
The International Consortium. The International HapMap Project.
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Testing the Neutral Mutation Hypothesis The neutral theory predicts that polymorphism within species is correlated positively with fixed differences between.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
Signatures of Selection
Fig. 2. —The 26 models implemented in this study
The Heritage of Pathogen Pressures and Ancient Demography in the Human Innate- Immunity CD209/CD209L Region  Luis B. Barreiro, Etienne Patin, Olivier Neyrolles,
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Imputation-based local ancestry inference in admixed populations
Patterns of Linkage Disequilibrium in the Human Genome
Statistical Modeling of Ancestral Processes
The ‘V’ in the Tajima D equation is:
Vineet Bafna/Pavel Pevzner
John Wakeley, Rasmus Nielsen, Shau Neen Liu-Cordero, Kristin Ardlie 
Ida Moltke, Matteo Fumagalli, Thorfinn S. Korneliussen, Jacob E
Pier Francesco Palamara, Todd Lencz, Ariel Darvasi, Itsik Pe’er 
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Last Update 12th May 2011 SESSION 41 & 42 Hypothesis Testing.
Chapter 18: Evolution and Origin of Species
The Heritage of Pathogen Pressures and Ancient Demography in the Human Innate- Immunity CD209/CD209L Region  Luis B. Barreiro, Etienne Patin, Olivier Neyrolles,
Presentation transcript:

Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF

Standard model of human evolution

Standard model of human evolution (Origin and spread of genus Homo) 2 – 2.5 Mya

Standard model of human evolution (Origin and spread of genus Homo) 1.6 – 1.8 Mya ? ?

Standard model of human evolution (Origin and spread of genus Homo) 0.8 – 1.0 Mya

Standard model of human evolution Origin and spread of ‘modern’ humans 150 – 200 Kya

Standard model of human evolution Origin and spread of ‘modern’ humans ~ 100 Kya

Standard model of human evolution Origin and spread of ‘modern’ humans 40 – 60 Kya

Standard model of human evolution Origin and spread of ‘modern’ humans 15 – 30 Kya

Estimating demographic parameters How can we quantify this qualitative scenario into an explicit model? How can we choose a model that is both biologically feasible as well as computationally tractable? How do we estimate parameters and quantify uncertainty in parameter estimates?

Estimating demographic parameters Calculating full likelihoods (under realistic models including recombination) is computationally infeasible So, compromises need to be made if one is interested in parameter estimation

African populations 10 populations 229 individuals

African populations San (bushmen) Biaka (pygmies) Mandenka (bantu) 61 autosomal loci ~ 350 Kb sequence data

A simple model of African population history T m g1g1 g2g2 Mandenka Biaka (or San)

Estimation method We use a composite-likelihood method (cf. Plagnol and Wall 2006) that uses information from the joint frequency spectrum such as: Numbers of segregating sites Numbers of shared and fixed differences Tajima’s D F ST Fu and Li’s D*

Estimation method We use a composite-likelihood method (cf. Plagnol and Wall 2006) that uses information from the joint frequency spectrum such as: Numbers of segregating sites Numbers of shared and fixed differences Tajima’s D F ST Fu and Li’s D*

Estimating likelihoods Pop1 Pop2

Estimating likelihoods Pop1 Pop2 Pop 1 private polymorphisms

Estimating likelihoods Pop1 Pop2 Pop 1 private polymorphisms Pop 2 private polymorphisms

Estimating likelihoods Pop1 Pop2 Pop 1 private polymorphisms Pop 2 private polymorphisms Shared polymorphisms

Estimation method We use a composite-likelihood method (cf. Plagnol and Wall 2006) that uses information from the joint frequency spectrum such as: Numbers of segregating sites Numbers of shared and fixed differences Tajima’s D F ST Fu and Li’s D*

Estimating likelihoods We assume these other statistics are multivariate normal. Then, we run simulations to estimate the means and the covariance matrix. This accounts (in a crude way) for dependencies across different summary statistics.

Composite likelihood We form a composite likelihood by assuming these two classes of summary statistics are independent from each other We estimate the (composite)-likelihood over a grid of values of g 1, g 2, T and M and tabulate the MLE. We also use standard asymptotic assumptions to estimate confidence intervals

Estimates (with 95% CI’s) ParameterMan-BiaMan-San g 1 (000’s)0 (0 – 3.8)0 (0 – 3.8) g 2 (000’s)4 (0 – 7.9)2 (0 – 11) T(000’s)450 (300 – 640)100 (77 – 550) M (= 4Nm)10 (8.4 – 12)3 (2.2 – 4)

Fit of the null model How well does the demographic null model fit the patterns of genetic variation found in the actual data?

Fit of the null model How well does the demographic null model fit the patterns of genetic variation found in the actual data? Quite well. The model accurately reproduces both parameters used in the original fitting (e.g., Tajima’s D in each population) as well as other aspects of the data (e.g., estimates of ρ = 4Nr)

Estimates (with 95% CI’s) ParameterMan-BiaMan-San g 1 (000’s)0 (0 – 3.8)0 (0 – 3.8) g 2 (000’s)4 (0 – 7.9)2 (0 – 11) T(000’s)450 (300 – 640)100 (77 – 550) M (= 4Nm)10 (8.4 – 12)3 (2.2 – 4)

Population growth time population size

Population growth time population size spread of agriculture and animal husbandry?

Estimates (with 95% CI’s) ParameterMan-BiaMan-San g 1 (000’s)0 (0 – 3.8)0 (0 – 3.8) g 2 (000’s)4 (0 – 7.9)2 (0 – 11) T(000’s)450 (300 – 640)100 (77 – 550) M (= 4Nm)10 (8.4 – 12)3 (2.2 – 4)

Ancestral structure in Africa At face value, these results suggest that population structure within Africa is old, and predates the migration of modern humans out of Africa. Is there any evidence for additional (unknown) ancient population structure within Africa?

Model of ancestral structure T m g1g1 g2g2 Mandenka Biaka (or San) Archaic human population

Standard model of human evolution Origin and spread of ‘modern’ humans ~ 100 Kya

Admixture mapping Modern human DNANeandertal DNA

Admixture mapping Modern human DNANeandertal DNA

Admixture mapping Modern human DNANeandertal DNA

Admixture mapping Modern human DNANeandertal DNA

Admixture mapping Modern human DNANeandertal DNA Orange chunks are ~10 – 100 Kb in length

Genealogy with archaic ancestry time present Modern humans Archaic humans

Genealogy without archaic ancestry time present Modern humans Archaic humans

Our main questions What pattern does archaic ancestry produce in DNA sequence polymorphism data (from extant humans)? How can we use data to –estimate the contribution of archaic humans to the modern gene pool (c)? –test whether c > 0?

Genealogy with archaic ancestry (Mutations added) time present Modern humans Archaic humans

Genealogy with archaic ancestry (Mutations added) time present Modern humans Archaic humans

Patterns in DNA sequence data Sequence 1 A T C C A C A G C T G Sequence 2 A G C C A C G G C T G Sequence 3 T G C G G T A A C C T Sequence 4 A G C C A C A G C T G Sequence 5 T G T G G T A A C C T Sequence 6 A G C C A T A G A T G Sequence 7 A G C C A T A G A T G

Patterns in DNA sequence data Sequence 1 A T C C A C A G C T G Sequence 2 A G C C A C G G C T G Sequence 3 T G C G G T A A C C T Sequence 4 A G C C A C A G C T G Sequence 5 T G T G G T A A C C T Sequence 6 A G C C A T A G A T G Sequence 7 A G C C A T A G A T G

Patterns in DNA sequence data Sequence 1 A T C C A C A G C T G Sequence 2 A G C C A C G G C T G Sequence 3 T G C G G T A A C C T Sequence 4 A G C C A C A G C T G Sequence 5 T G T G G T A A C C T Sequence 6 A G C C A T A G A T G Sequence 7 A G C C A T A G A T G We call the sites in red congruent sites – these are sites inferred to be on the same branch of an unrooted tree

Linkage disequilibrium (LD) LD is the nonrandom association of alleles at different sites. Low LD:ACHigh LD:AC ATAC ACAC ATAC GCGT GTGT GCGT GTGT High recombinationLow recombination

Measuring ‘congruence’ To measure the level of ‘congruence’ in SNP data from larger regions we define a score function S* = where S (i 1,... i k ) = and S (i j, i j+1 ) is a function of both congruence (or near congruence) and physical distance between i j and i j+1.

An example

An example (CHRNA4)

How often is S* from simulations greater than or equal to the S* value from the actual data?

An example (CHRNA4) How often is S* from simulations greater than or equal to the S* value from the actual data?p = 0.025

S* is sensitive to ancient admixture

General approach We use the model parameters estimated before (growth rates, migration rate, split time) as a demographic null model. Is our null model sufficient to explain the patterns of LD in the data? We test this by comparing the observed S* values with the distribution of S* values calculated from data simulated under the null model.

Distribution of p-values (Mandenka and San) p-value frequency

Distribution of p-values (Mandenka and San) p-value frequency Global p-value: 2.5 * 10 -5

Estimating ancient admixture rates The global p-values for S* are highly significant in every population that we’ve studied! If we estimate the ancient admixture rate in our (composite)-likelihood framework, we can exclude no ancient admixture for all populations studied.

A region on chromosome 4

19 mutations (from 6 Kb of sequence) separate 3 Biaka sequences from all of the other sequences in our sample. Simulations suggest this cannot be caused by recent population structure (p < ) This corresponds to isolation lasting ~1.5 million years!

Possible explanations Isolation followed by later mixing is a recurrent feature of human population history Mixing between ‘archaic’ humans and modern humans happened at least once prior to the exodus of modern humans out of Africa Some other feature of population structure is unaccounted for in our simple models

Acknowledgments Collaborators: Mike Hammer (U. of Arizona) Vincent Plagnol (Cambridge University) Samples: Foundation Jean Dausset (CEPH) Y chromosome consortium (YCC) Funding: National Science Foundation National Institutes for Health