The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes.

Slides:



Advertisements
Similar presentations
Population Genetics 3 We can learn a lot about the origins and movements of populations from genetics Did all modern humans come from Africa? Are we derived.
Advertisements

Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.
Introduction to Haplotype Estimation Stat/Biostat 550.
Coalescent Module- Faro July 26th-28th 04 Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Population Genetics, Recombination Histories & Global Pedigrees Finding Minimal Recombination Histories Global Pedigrees Finding.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Recombination and genetic variation – models and inference
Sampling distributions of alleles under models of neutral evolution.
Genomes as the Hub of Biology UNIT 2. The hub of biology As biologists, we seek not only to understand how a single organism works, but how organisms.
Preview What does Recombination do to Sequence Histories. Probabilities of such histories. Quantities of interest. Detecting & Reconstructing Recombinations.
Lecture 23: Introduction to Coalescence April 7, 2014.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
14 Molecular Evolution and Population Genetics
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF.
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen.
Advanced Questions in Sequence Evolution Models Context-dependent models Genome: Dinucleotides..ACGGA.. Di-nucleotide events ACGGAGT ACGTCGT Irreversibility.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
Combinatorics & the Coalescent ( ) Tree Counting & Tree Properties. Basic Combinatorics. Allele distribution. Polya Urns + Stirling Numbers. Number.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Lecture 3: population genetics I: mutation and recombination
Population genetics. Population genetics concerns the study of genetic variation and change within a population. While for evolving species there is no.
Simon Myers, Gil McVean Department of Statistics, Oxford Recombination and genetic variation – models and inference.
Deviations from HWE I. Mutation II. Migration III. Non-Random Mating IV. Genetic Drift A. Sampling Error.
Models and their benefits. Models + Data 1. probability of data (statistics...) 2. probability of individual histories 3. hypothesis testing 4. parameter.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Coalescent Models for Genetic Demography
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Lecture 16 Tuesday, April 9, 2013 BiSc 001 Spring 2013 Guest Lecture Dr. Jihye Park.
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Testing the Neutral Mutation Hypothesis The neutral theory predicts that polymorphism within species is correlated positively with fixed differences between.
Restriction enzyme analysis The new(ish) population genetics Old view New view Allele frequency change looking forward in time; alleles either the same.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
Modelling evolution Gil McVean Department of Statistics TC A G.
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
Gil McVean Department of Statistics
Deviations from HWE I. Mutation II. Migration III. Non-Random Mating
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Testing the Neutral Mutation Hypothesis
Lecture 2: Basic Population Genetics
The coalescent with recombination (Chapter 5, Part 1)
Recombination, Phylogenies and Parsimony
Outline Cancer Progression Models
Chapter 18: Evolution and Origin of Species
Presentation transcript:

The Coalescent & Human Sequence Variation ( ) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes. II. The Coalescent with Mutations & Ancestral Analysis. III. ”The Story” of Human Evolution. IV. ”The Story” of Coalescent.

The Human Genome X Y billion base pairs per haploid genome genes

Recent SNP/Haplotype Analysis. Inter.SNP Consortium (2001): A map of human genome sequence variation containing 1.42 million SNPs. Nature For 2 complete haplotype genomes, there would be expected about 3 million SNP differences. The number of expected SNPs for more genomes should then grow as the expected number of segregating sites in an ideal population, i.e. approximately logarithmically.

Linkage disequilibrium in the human genome Reich,DE et al.(2001) Linkage disequilibrium in the human genome Nature LD:=f i,j -f i f j E(LD)=1/(1+4N e r) LD in europeans stretches 60kb, while in Yorubans much less.

Daly,JM et al.(2001) High-resolution haplotype structure in the human genome. Nat.Gen Conclusions (258 chromosomes): Haplotypes better than SNPs. Genome spit up in blocks without recombination.

SNPs & haplotypes: Getting Haplotypes Egg & Sperm Sequencing Cell Lines with Lost Chromosomes Sequencing Clones Spanning SNPs Very expensive so reconstructing haplotypes from SNPs are favoured. Haplotypes: SNPs: A T G C C A {A,T}{C,G}{A,C} 2 m-1

SNPs --> haplotypes:Computational Problem {N 1,N 2 } 1,1 {N 1,N 2 } 1,m 1 m 1 n {N 1,N 2 } n,1 {N 1,N 2 } n,m 1,1 n,1 ? {N 1 or N 2 } 1,1 {N 1 or N 2 } n,m {N 1 or N 2 } 1,m {N 1 or N 2 } n,1

SNPs ---> haplotypes: Clark (1990) Algorithm: Find homozygotes or single heterozygotes & deduce existing haplotypes. Run through remaining SNPs and assign to expanding set of determined haplotypes. Check if unresolved haplotypes can be explained as recombination of resolved haplotypes.

SNPs ---> haplotypes: Clark (1990) Three Problems: No homozygote or single heterozygotes available The process leaves unresolved haplotypes A haplotype is declared a recombination between two existing haplotypes, although it exists in the sample. Spanning Tree instead of phylogenetic tree assignment of haplotypes H2 H6 H1 H5 H4 H3 H1H2 H3H4 Spanning Tree:Phylogenetic Tree:

SNPs ---> haplotypes: Gusfield (2002) Haplotype Inference: Make Phylogeny with 2n leaves exhausting the SNPs. Perfect Phylogeny: only 0 or 1 event at each site. A position in an individual is labelled 0 and 1 if homozygous for one of the two variants and labelled 2 if heterozygous. S 1 2 a 2 2 b 0 2 c a 2 2 a’ 2 2 b 0 2 b’ 0 2 c 1 0 c’ 1 0 QB 1 2 a 1 0 a’ 0 1 b 0 1 b’ 0 0 c 1 0 c’ 1 0 T(S) a a’ c b c’ b’ 1 2

SNPs ---> haplotypes: Gusfield (2002) PPH can be reduced to graph realization problem: Recognizing graphic binary matroids. This problems has an almost linear algorithm (that has never been implemented) This also allows efficient enumeration of possible solutions. Question: Are there SNP data that doesn’t allow a perfect tree.

SNPs ---> haplotypes: Stephens (2001) G=(G 1,..,G n ) SNP-types. H=(H 1,..,H n ) haplotypes F=(F 1,..,F m ) population haplotype frequencies. f=(f 1,..,f m ) sample haplotype frequencies. i.Find F that maximizes the probability of the observed sample. ii. The same for population parameters. iii. Simulation is very easy. H1H1 H4H4 H3H3 H2H2

The Exponential Distribution. The Exponential Distribution: R+ Expo(a) Density: f(t) = ae -at, P(X>t)= e -at Properties: X ~ Exp(a) Y ~ Exp(b) independent i. P(X>t 2 |X>t 1 ) = P(X>t 2 -t 1 ) (t 2 > t 1 ) ii. E(X) = 1/a. iii. P(X < Y) = a/(a + b). iv. min(X,Y) ~ Exp (a + b). v. Sums of k iid X i is  (k,a) distributed

The Standard Coalescent Two independent Processes Continuous: Exponential Waiting Times Discrete: Choosing Pairs to Coalesce WaitingCoalescing (4,5) (1,2)--(3,(4,5)) 1--2 {1}{2}{3}{4}{5} {1,2}{3,4,5} {1,2,3,4,5} {1,2}{3}{4,5} {1}{2}{3,4,5}

Additional Evolutionary Factors Geographical Structure. Admixture can create longer LD islands Population Growth. Present LD in the large population can have small population characteristics Recombination/Gene Conversion. GC can create close fall in LD relative to distant LD Selection. Selective sweeps can create strong LDs locally.

Two sequences, infinite sites & k differences The probability that there are k differences between two sequences. Going back in time, 2 kinds of events can occur (mutations (  - or a coalescent (1). This gives a geometric distribution. --* *------* *----*----*----*--- Exp(1) Exp(  ) E k (MRCA) = Distribution of waiting time to j’th newest mutation is  (j,1+  ) TMRCA is  (k+1,1+  ) + distributed.

n sequences, infinite sites & k differences. Russell Thompson n s s s s Exp(k(k-1)/2) Exp(k  ) Oldest mutation Only the number of segregating sites are observed. Explicit Expressions or simple recursions exits for distributions analogous to the 2 sequence case.

Classical Polya Urns Feller I. 213 Let X 0 be the initial configuration of the initial Urn. A step: take a random ball the urn and put it back together with an extra of the same colour. X k be the content after the k’th step. Let Y k be the colour of the k’th picked ball. i. P{Y k =j} = P{Y 1 =j}. ii. Sequences Y 1... Y k resulting in the same X k - has the same probability.

Labelling, Polya Urns & Age of Alleles (Donnelly, Hoppe, ) An Urn: 1   A ball is picked proportionally to its weight. Ordinary balls have weight 1. If the initial  -size ball is picked, it is replaced together with a completely new type. If an ordinary ball is picked, it is replaced together with a copy of itself. There is a simple relationship between the distribution of ”the alleles labeled with age ranking” is the same as ”the alleles labeled with size ranking” As they come By size By age

n sequences, infinite sites & 1 segregating site (d,n-d) M.Stephens 2000, Griffiths & Tavare, d d n 12d d+1 … n Distribution of Age of the Mutation: where and Lastly Population analogues can be obtained by n  2N

Shape of Tree Hanging below a Mutation Griffiths & Tavare, d d n 12d d+1 … n Probability that a specified edge when there were k lineages has b descendants. It is possible to describe the shape of the hanging tree. k lineages

Mutations & their Branch. Wiuf & Donnelly (1999) Wiuf ( a,b) Exact expression can be obtained for start and end of mutation branch and position of mutation. Approximations for small (< 10%) mutation tree that also allows the mutation to have a selection coeffecient. f – frequency of mutant, n=1000

Mutations & their Branch. Wiuf (2001a,b) The Effect of Selection & Growth.

Cystic Fibrosis (Wiuf 2001)  F508 – possibly maintained by heterosis (1.023)- higher resistance to Salmonella infections. Data: Frequency of  F508-allele Inter variability in individuals 46 variable positions. Model of human demography. Model parameters: mutation rate, heterosis advantage and an exponential growth model of human population expansion. /\ * \ / \ /\ \ Estimated age of  F508 is estimated to be:

Human History-Two Levels: Physical & Genealogical The physical population size, N(t), and the efficient population size, N e (t) are separate concepts. i. N(t) can mainly be studied by historical/archeological means, ii. N e (t) can be studied genealogically, for instance by tracing the ancestries of DNA sequences. Main departures from simplest Population Genetical Models: A. Long epochs of exponential growth at increasing rates B. Bottlenecks. C. Migrations & Geographical subdivisions

Out of Africa Multiregional Model 1 st Origin of Humans in Africa 3-5 Myr ago is relatively accepted. A 2 nd origin from Africa Kyr ago is controversial. 1. Was there a population expansion from Africa replacing the populations in Asia/Europe that left fossil as asserted by the Out of Africa Model. 2. Or did this expansion hybridize with the local population as asserted by the Multiregional Model.

From Templeton,2002 March Nature

From Cavalli-Sfroza,2001 Human Migrations

Cavalli- Sforza: Language Trees Cavalli-Sforza (1997) Genes Peoples and Languages PNAS Principle of Comparison. Loss of cognates (“homologous” words) Syntax Comparison. Sound use. Reconstruction dependent on interpretation – stretches back years dependent on criteria.

Cavalli- Sforza:Principal Components- Agriculture,… Agriculture 6-10 Kyr Greek Colonisation 3 Kyr Retraction of the Basques. Uralic People Horse domestication

Homo Sapiens & the Neanderthal (Nordborg) Two Scenarios: Constant Female Pop.Size Growing for years to 5*10 8. Problem: Can the observed be explained by one common H.sapiens - Neanderthal population? Neanderthal TeTe TtTt tsts 986 H. sapiens Constant Pop.size Recent Growth E(A()) P(topology) P(topology & T t > 4T e )

Summary The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes. The Coalescent with Mutations. The Human Population, its history & its Genome. A serious gap between capabilities of theory and the demand of existing data.

History of Coalescent Theory s: Genealogical arguments well known to Wright & Fisher. 1964: Crow & Kimura: Infinite Allele Model 1966: (Hubby & Lewontin) & (Harris) make first surveys of population allele variation by protein electrophoresis. 1968: Motoo Kimura proposes neutral explanation of molecular evolution & population variation. So does King & Jukes 1971: Kimura & Otha proposes infinite sites model. 1975: Watterson makes explicit use of “The Coalescent” 1982: Kingman introduces “The Coalescent”. 1983: Hudson introduces “The Coalescent with Recombination” 1983: Kreitman publishes first major population sequences. 1987: Cann et al. tries to trace human origin and migrations with mitochondrial DNA.

: Kaplan, Hudson, Takahata and others: Selection regimes with coalescent structure (MHC, Incompatibility alleles). 1988: Hughes & Nei: Genes with positive Darwinian Selection Griffiths, Ethier & Tavare calculates inf.site data probability. 1991: MacDonald & Kreitman: Data with surplus of replacement interspecific substitutions. 1991: Aquadro & Begun: Positive recombination-nucleotide variation correlation : Griffiths-Tavaré + Kuhner-Yamoto-Felsensenstein introduces highly computer intensitive simulation techniquees to estimate parameters in population models Krone-Neuhauser introduces selection in Coalescent Donnelly, Stephens, Fearnhead et al.: Major accelerations in coalescent based data analysis. 1999: Wiuf & Donnelly uses Coalescent Theory to estimate age of disease allele 2000: Wiuf et al. introduces gene conversion into coalescent : Several groups combines Coalescent Theory & Gene Mapping. A flood of SNP data & haplotypes are on their way.

Recommended Literature & www-sites Cavalli-Sforza (2001) Genes, People and Language. Penguin. Clark,A. (1990) ”Inference of Haplotypes from PCR-amplified Samples of Diploid Populations Mol.Biol.Evol Daly,JM et al.(2001) High-resolution haplotype structure in the human genome. Nat.Gen Donnelly,P. and R.Foley (eds) (2001) Genes, Fossils and Behavious IOS Press. Goldstein, DB & Chikhi (2002) ”Human Migrations and Population Structure” Annu.Rev.Genomics Hum.Genetics (forthcoming) Griffiths, RC ”Ancestral Inference from Gene Trees” in Donnelly,P. and R.Foley (eds) (2001) Genes, Fossils and Behavious IOS Press. Gusfield (2002) Haplotypes as perfect phylogeny. To appear in Recomb2002 Hoppe (1984) ”Polya-like urns and the Ewens’ sampling formula” J.Math.Biol Harpending & Rogers (2000) Genetic Perspectives on Human Origins and Differentiation. Annu.Rev. Genom.Hum.Genet Inter.SNP Consortium (2001): A map of human genome sequence variation containing 1.42 million SNPs. Nature Nichols, J. (1997) Modelling Ancient Population Structures and Movement in Linguistics. Annu.Rev.Anthrop Reich,DE et al.(2001) Linkage disequilibrium in the human genome Nature Relethford (2001) Genetics and the Search for Modern Human Origins. Wiley Slatkin & Rannala (2000) ”Estimating Allele Age” Annu Rev.Genomics Hum.Genet Stephens,M.(1999) ”Times on Trees, and the Age of an Allele” Theor.Pop.Biol Stephens,M et al.(2001) ”A New Statistical Method for Haplotype Reconstruction from Population Data” Am.J.Hum.Gen Templeton, A. (2002) ”Out of Africa again and again” Nature vol Thompson,R. (1998) ”Ages of mutations on a coalescent tree” Math.Bios Wiuf & Hein (1997) ”The Number of Ancestors to DNA Sequence” Genetics Wiuf (2000) ”On the Genealogy of a Sample of Neutral Rare Alleles” Theor.Pop.Biol Wiuf (2001) ”Rare Alleles and Selection” Theor.Pop.Biol Wiuf (2001)Do DF508 heterozygotes have a selective advantage? Genet.Res.Cam Wiuf & Donnelly (1999) Conditional Genealogies and the Age of a Mutant. Theor. Pop.Biol Mikkel Schierup’s program package Gil McVean’s course in population genetics: