1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.

Slides:



Advertisements
Similar presentations
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Advertisements

Note that the genetic map is different for men and women Recombination frequency is higher in meiosis in women.
Alleles = A, a Genotypes = AA, Aa, aa
METHODS FOR HAPLOTYPE RECONSTRUCTION
A method of quantifying stability and change in a population.
Sampling distributions of alleles under models of neutral evolution.
Discovery of a rare arboreal forest-dwelling flying reptile (Pterosauria, Pterodactyloidea) from China Wang et al. PNAS Feb. 11, 2008.
MALD Mapping by Admixture Linkage Disequilibrium.
14 Molecular Evolution and Population Genetics
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Population Genetics What is population genetics?
CSE182-L18 Population Genetics. Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
CSE182-L17 Clustering Population Genetics: Basics.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
Population Genetics Learning Objectives
Molecular phylogenetics
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Population Genetics: Chapter 3 Epidemiology 217 January 16, 2011.
E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna Population data Recall that we often study a population in the form of a SNP matrix – Rows.
Genetic Linkage. Two pops may have the same allele frequencies but different chromosome frequencies.
Course outline HWE: What happens when Hardy- Weinberg assumptions are met Inheritance: Multiple alleles in a population; Transmission of alleles in a family.
CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna.
E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna Population data Recall that we often study a population in the form of a SNP matrix – Rows.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
INTRODUCTION TO ASSOCIATION MAPPING
Discovery of a rare arboreal forest-dwelling flying reptile (Pterosauria, Pterodactyloidea) from China Wang et al. PNAS Feb. 11, 2008.
Copyright © 2004 Pearson Prentice Hall, Inc. Chapter 7 Multiple Loci & Sex=recombination.
Allele Frequencies: Staying Constant Chapter 14. What is Allele Frequency? How frequent any allele is in a given population: –Within one race –Within.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Lecture 21: Quantitative Traits I Date: 11/05/02  Review: covariance, regression, etc  Introduction to quantitative genetics.
The Hardy-Weinberg principle is like a Punnett square for populations, instead of individuals. A Punnett square can predict the probability of offspring's.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.
Wi’08Structure Population sub-structure. Wi’08Structure Projects Harish/Nitin Gaurav (Tuesday) Stefano/Hossein (Tuesday) Nisha/Yu David Jian/Josue (Tuesday)
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
1.Stream A and Stream B are located on two isolated islands with similar characteristics. How do these two stream beds differ? 2.Suppose a fish that varies.
CSE280Vineet Bafna In a ‘stable’ population, the distribution of alleles obeys certain laws – Not really, and the deviations are interesting HW Equilibrium.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
Association tests. Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.
1,3, ,
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Equilibria in populations
Genetic Linkage.
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
L4: Counting Recombination events
Genetic Linkage.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Estimating Recombination Rates
The ‘V’ in the Tajima D equation is:
Basic concepts on population genetics
Vineet Bafna/Pavel Pevzner
The coalescent with recombination (Chapter 5, Part 1)
Genetic Linkage.
Outline Cancer Progression Models
Presentation transcript:

1 Population Genetics Basics

2 Terminology review Allele Locus Diploid SNP

3 Single Nucleotide Polymorphisms Infinite Sites Assumption: Each site mutates at most once

4 What causes variation in a population? Mutations (may lead to SNPs) Recombinations Other genetic events (gene conversion) Structural Polymorphisms

5 Recombination

6 Gene Conversion Gene Conversion versus crossover – Hard to distinguish in a population

7 Structural polymorphisms Large scale structural changes (deletions/insertions/inversions) may occur in a population.

8 Topic 1: Basic Principles In a ‘stable’ population, the distribution of alleles obeys certain laws – Not really, and the deviations are interesting HW Equilibrium – (due to mixing in a population) Linkage (dis)-equilibrium – Due to recombination

9 Hardy Weinberg equilibrium Consider a locus with 2 alleles, A, a p (respectively, q) is the frequency of A (resp. a) in the population 3 Genotypes: AA, Aa, aa Q: What is the frequency of each genotype If various assumptions are satisfied, (such as random mating, no natural selection), Then P AA =p 2 P Aa =2pq P aa =q 2

10 Hardy Weinberg: why? Assumptions: – Diploid – Sexual reproduction – Random mating – Bi-allelic sites – Large population size, … Why? Each individual randomly picks his two chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on.

11 Hardy Weinberg: Generalizations Multiple alleles with frequencies – By HW, Multiple loci?

12 Hardy Weinberg: Implications The allele frequency does not change from generation to generation. Why? It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the mutation? Males are 100 times more likely to have the “red’ type of color blindness than females. Why? Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting.

13 Recombination

14 What if there were no recombinations? Life would be simpler Each individual sequence would have a single parent (even for higher ploidy) The relationship is expressed as a tree.

15 The Infinite Sites Assumption The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. Some phenotypes could be linked to the polymorphisms Some of the linkage is “destroyed” by recombination

16 Infinite sites assumption and Perfect Phylogeny Each site is mutated at most once in the history. All descendants must carry the mutated value, and all others must carry the ancestral value i 1 in position i 0 in position i

17 Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny.

18 The 4-gamete condition A column i partitions the set of species into two sets i 0, and i 1 A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous. EX: i is heterogenous w.r.t {A,D,E} i A 0 B 0 C 0 D 1 E 1 F 1 i0i0 i1i1

19 4 Gamete Condition – There exists a perfect phylogeny if and only if for all pair of columns (i,j), either j is not heterogenous w.r.t i 0, or i 1. – Equivalent to – There exists a perfect phylogeny if and only if for all pairs of columns (i,j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1)

20 4-gamete condition: proof Depending on which edge the mutation j occurs, either i 0, or i 1 should be homogenous. (only if) Every perfect phylogeny satisfies the 4- gamete condition (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist? i0i0 i1i1 i

21 An algorithm for constructing a perfect phylogeny We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. In any tree, each node (except the root) has a single parent. – It is sufficient to construct a parent for every node. In each step, we add a column and refine some of the nodes containing multiple children. Stop if all columns have been considered.

22 Inclusion Property For any pair of columns i,j – i < j if and only if i 1  j 1 Note that if i<j then the edge containing i is an ancestor of the edge containing i i j

23 Example A B C D E r A BCDE Initially, there is a single clade r, and each node has r as its parent

24 Sort columns Sort columns according to the inclusion property (note that the columns are already sorted here). This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order A B C D E

25 Add first column In adding column i – Check each edge and decide which side you belong. – Finally add a node if you can resolve a clade r A B C D E A B C D E u

26 Adding other columns Add other columns on edges using the ordering property r E B C D A A B C D E

27 Unrooted case Switch the values in each column, so that 0 is the majority element. Apply the algorithm for the rooted case

28 Handling recombination A tree is not sufficient as a sequence may have 2 parents Recombination leads to loss of correlation between columns

29 Linkage (Dis)-equilibrium (LD) Consider sites A &B Case 1: No recombination – Pr[A,B=0,1] = 0.25 Linkage disequilibrium Case 2:Extensive recombination – Pr[A,B=(0,1)=0.125 Linkage equilibrium AB AB

30 Handling recombination A tree is not sufficient as a sequence may have 2 parents Recombination leads to loss of correlation between columns

31 Recombination, and populations Think of a population of N individual chromosomes. The population remains stable from generation to generation. Without recombination, each individual has exactly one parent chromosome from the previous generation. With recombinations, each individual is derived from one or two parents. We will formalize this notion later in the context of coalescent theory.

32 Linkage (Dis)-equilibrium (LD) Consider sites A &B Case 1: No recombination Each new individual chromosome chooses a parent from the existing ‘haplotype’ AB AB

33 Linkage (Dis)-equilibrium (LD) Consider sites A &B Case 2: diploidy and recombination Each new individual chooses a parent from the existing alleles AB AB

34 Linkage (Dis)-equilibrium (LD) Consider sites A &B Case 1: No recombination Each new individual chooses a parent from the existing ‘haplotype’ – Pr[A,B=0,1] = 0.25 Linkage disequilibrium Case 2: Extensive recombination Each new individual simply chooses and allele from either site – Pr[A,B=(0,1)=0.125 Linkage equilibrium AB AB

35 LD In the absence of recombination, – Correlation between columns – The joint probability Pr[A=a,B=b] is different from P(a)P(b) With extensive recombination – Pr(a,b)=P(a)P(b)

36 Measures of LD Consider two bi-allelic sites with alleles marked with 0 and 1 Define – P 00 = Pr[Allele 0 in locus 1, and 0 in locus 2] – P 0* = Pr[Allele 0 in locus 1] Linkage equilibrium if P 00 = P 0* P *0 D = abs(P 00 - P 0* P *0 ) = abs(P 01 - P 0* P *1 ) = …

37 LD over time With random mating, and fixed recombination rate r between the sites, Linkage Disequilibrium will disappear – Let D (t) = LD at time t – P (t) 00 = (1-r) P (t-1) 00 + r P (t-1) 0* P (t-1) *0 – D (t) = P (t) 00 - P (t) 0* P (t) *0 = P (t) 00 - P (t-1) 0* P (t-1) *0 – D (t) =(1-r) D (t-1) =(1-r) t D (0)

38 LD over distance Assumption – Recombination rate increases linearly with distance – LD decays exponentially with distance. The assumption is reasonable, but recombination rates vary from region to region, adding to complexity This simple fact is the basis of disease association mapping.

39 LD and disease mapping Consider a mutation that is causal for a disease. The goal of disease gene mapping is to discover which gene (locus) carries the mutation. Consider every polymorphism, and check: – There might be too many polymorphisms – Multiple mutations (even at a single locus) that lead to the same disease Instead, consider a dense sample of polymorphisms that span the genome

40 LD can be used to map disease genes LD decays with distance from the disease allele. By plotting LD, one can short list the region containing the disease gene DNNDDNDNNDDN LD

41 LD and disease gene mapping problems Marker density? Complex diseases Population sub-structure

42 Human Samples We look at data from human samples Gabriel et al. Science – 3 populations were sampled at multiple regions spanning the genome 54 regions (Average size 250Kb) SNP density 1 over 2Kb 90 Individuals from Nigeria (Yoruban) 93 Europeans 42 Asian 50 African American

43 Population specific recombination D’ was used as the measure between SNP pairs. SNP pairs were classified in one of the following – Strong LD – Strong evidence for recombination – Others (13% of cases) This roughly favors out-of- africa. A Coalescent simulation can help give confidence values on this. Gabriel et al., Science 2002

44 Haplotype Blocks A haplotype block is a region of low recombination. – Define a region as a block if less than 5% of the pairs show strong recombination Much of the genome is in blocks. Distribution of block sizes vary across populations.

45 Testing Out-of-Africa Generate simulations with and without migration. Check size of haplotype blocks. – Does it vary when migrations are allowed? – When the ‘new’ population has a bottleneck? If there was a bottleneck that created European and Asian populations, can we say anything about frequency of alleles that are ‘African specific’? – Should they be high frequency, or low frequency in African populations?

46 Haplotype Block: implications The genome is mostly partitioned into haplotype blocks. Within a block, there is extensive LD. – Is this good, or bad, for association mapping?

47 Coalescent reconstruction Reconstructing likely coalescents

48 Re-constructing history in the absence of recombination

49 An algorithm for constructing a perfect phylogeny We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. In any tree, each node (except the root) has a single parent. – It is sufficient to construct a parent for every node. In each step, we add a column and refine some of the nodes containing multiple children. Stop if all columns have been considered.

50 Inclusion Property For any pair of columns i,j – i < j if and only if i 1  j 1 Note that if i<j then the edge containing i is an ancestor of the edge containing i i j

51 Example A B C D E r A BCDE Initially, there is a single clade r, and each node has r as its parent

52 Sort columns Sort columns according to the inclusion property (note that the columns are already sorted here). This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order A B C D E

53 Add first column In adding column i – Check each edge and decide which side you belong. – Finally add a node if you can resolve a clade r A B C D E A B C D E u

54 Adding other columns Add other columns on edges using the ordering property r E B C D A A B C D E

55 Unrooted case Important point is that the perfect phylogeny condition does not change when you interchange 1s and 0s at a column. Switch the values in each column, so that 0 is the majority element. Apply the algorithm for the rooted case. Homework: show that this is a correct algorithm

56 Population Sub-structure

57 Population sub-structure can increase LD Consider two populations that were isolated and evolving independently. They might have different allele frequencies in some regions. Pick two regions that are far apart (LD is very low, close to 0) Pop. A Pop. B p 1 =0.1 q 1 =0.9 P 11 =0.1 D=0.01 p 1 =0.9 q 1 =0.1 P 11 =0.1 D=0.01

58 Recent ad-mixing of population If the populations came together recently (Ex: African and European population), artificial LD might be created. D = 0.15 (instead of 0.01), increases 10-fold This spurious LD might lead to false associations Other genetic events can cause LD to arise, and one needs to be careful Pop. A+B p 1 =0.5 q 1 =0.5 P 11 =0.1 D= =0.15

59 Determining population sub-structure Given a mix of people, can you sub-divide them into ethnic populations. Turn the ‘problem’ of spurious LD into a clue. – Find markers that are too far apart to show LD – If they do show LD (correlation), that shows the existence of multiple populations. – Sub-divide them into populations so that LD disappears.

60 Determining Population sub-structure Same example as before: The two markers are too similar to show any LD, yet they do show LD. However, if you split them so that all 0..1 are in one population and all 1..0 are in another, LD disappears

61 Iterative algorithm for population sub- structure Define N = number of individuals (each has a single chromosome) k = number of sub-populations. Z  {1..k} N is a vector giving the sub-population. – Z i =k’ => individual i is assigned to population k’ X i,j = allelic value for individual i in position j P k,j,l = frequency of allele l at position j in population k

62 Example Ex: consider the following assignment P 1,1,0 = 0.9 P 2,1,0 =

63 Goal X is known. P, Z are unknown. The goal is to estimate Pr(P,Z|X) Various learning techniques can be employed. – max P,Z Pr(X|P,Z) (Max likelihood estimate) – max P,Z Pr(X|P,Z) Pr(P,Z) (MAP) – Sample P,Z from Pr(P,Z|X) Here a Bayesian (MCMC) scheme is employed to sample from Pr(P,Z|X). We will only consider a simplified version

64 Algorithm:Structure Iteratively estimate – (Z (0),P (0) ), (Z (1),P (1) ),.., (Z (m),P (m) ) After ‘convergence’, Z (m) is the answer. Iteration – Guess Z (0) – For m = 1,2,.. Sample P (m) from Pr(P | X, Z (m-1) ) Sample Z (m) from Pr(Z | X, P (m) ) How is this sampling done?

65 Example Choose Z at random, so each individual is assigned to be in one of 2 populations. See example. Now, we need to sample P (1) from Pr(P | X, Z (0) ) Simply count N k,j,l = number of people in pouplation k which have allele l in position j p k,j,l = N k,j,l / N

66 Example N k,j,l = number of people in population k which have allele l in position j p k,j,l = N k,j,l / N k,j,* N 1,1,0 = 4 N 1,1,1 = 6 p 1,1,0 = 4/10 p 1,2,0 = 4/10 Thus, we can sample P (m)

67 Sampling Z Pr[Z 1 = 1] = Pr[”01” belongs to population 1]? We know that each position should be in linkage equilibrium and independent. Pr[”01” |Population 1] = p 1,1,0 * p 1,2,1 =(4/10)*(6/10)=(0.24) Pr[”01” |Population 2] = p 2,1,0 * p 2,2,1 = (6/10)*(4/10)=0.24 Pr [Z 1 = 1] = 0.24/( ) = 0.5 Assuming, HWE, and LE

68 Sampling Suppose, during the iteration, there is a bias. Then, in the next step of sampling Z, we will do the right thing Pr[“01”| pop. 1] = p 1,1,0 * p 1,2,1 = 0.7*0.7 = 0.49 Pr[“01”| pop. 2] = p 2,1,0 * p 2,2,1 =0.3*0.3 = 0.09 Pr[Z 1 = 1] = 0.49/( ) = 0.85 Pr[Z 6 = 1] = 0.49/( ) = 0.85 Eventually all “01” will become 1 population, and all “10” will become a second population

69 Allowing for admixture Define q i,k as the fraction of individual i that originated from population k. Iteration – Guess Z (0) – For m = 1,2,.. Sample P (m),Q (m) from Pr(P,Q | X, Z (m-1) ) Sample Z (m) from Pr(Z | X, P (m),Q (m) )

70 Estimating Z (admixture case) Instead of estimating Pr(Z(i)=k|X,P,Q), (origin of individual i is k), we estimate Pr(Z(i,j,l)=k|X,P,Q) i,1 i,2 j

71 Results on admixture prediction

72 Results: Thrush data For each individual, q(i) is plotted as the distance to the opposite side of the triangle. The assignment is reliable, and there is evidence of admixture.

73 Population Structure 377 locations (loci) were sampled in 1000 people from 52 populations. 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) Africa EurasiaEast Asia America Oceania

74 Population sub-structure:research problem Systematically explore the effect of admixture. Can admixture be predicted for a locus, or for an individual The sampling approach may or may not be appropriate. Formulate as an optimization/learning problem: – (w/out admixture). Assign individuals to sub-populations so as to maximize linkage equilibrium, and hardy weinberg equilibrium in each of the sub-populations – (w/ admixture) Assign (individuals, loci) to sub-populations