1,3,5 9 2 7,8 4 0 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 1 0 1 0 0 0 0 2 1 2 1.

Slides:



Advertisements
Similar presentations
PV92 PCR/Informatics Kit
Advertisements

Note that the genetic map is different for men and women Recombination frequency is higher in meiosis in women.
Alleles = A, a Genotypes = AA, Aa, aa
Sampling distributions of alleles under models of neutral evolution.
Hardy-Weinberg Equilibrium
Basics of Linkage Analysis
BMI 731- Winter 2005 Chapter1: SNP Analysis Catalin Barbacioru Department of Biomedical Informatics Ohio State University.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
CSE182-L17 Clustering Population Genetics: Basics.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Brachydactyly and evolutionary change
 Read Chapter 6 of text  We saw in chapter 5 that a cross between two individuals heterozygous for a dominant allele produces a 3:1 ratio of individuals.
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Population Genetics I. Basic Principles. Population Genetics I. Basic Principles A. Definitions: - Population: a group of interbreeding organisms that.
E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna Population data Recall that we often study a population in the form of a SNP matrix – Rows.
Course outline HWE: What happens when Hardy- Weinberg assumptions are met Inheritance: Multiple alleles in a population; Transmission of alleles in a family.
CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna.
E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna Population data Recall that we often study a population in the form of a SNP matrix – Rows.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
AP Biology Lab 7: Genetics (Fly Lab). AP Biology Lab 7: Genetics (Fly Lab)  Description  given fly of unknown genotype use crosses to determine mode.
Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
INTRODUCTION TO ASSOCIATION MAPPING
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Tutorial #10 by Ma’ayan Fishelson. Classical Method of Linkage Analysis The classical method was parametric linkage analysis  the Lod-score method. This.
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Lecture 21: Quantitative Traits I Date: 11/05/02  Review: covariance, regression, etc  Introduction to quantitative genetics.
The Hardy-Weinberg principle is like a Punnett square for populations, instead of individuals. A Punnett square can predict the probability of offspring's.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
What happens to genes and alleles of genes in populations? If a new allele appears because of a mutation, does it… …immediately disappear? …become a permanent.
Vineet Bafna CSE280A CSE280Vineet Bafna. We will cover topics from Population Genetics. The focus will be on the use of algorithms for analyzing genetic.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
1.Stream A and Stream B are located on two isolated islands with similar characteristics. How do these two stream beds differ? 2.Suppose a fish that varies.
CSE280Vineet Bafna In a ‘stable’ population, the distribution of alleles obeys certain laws – Not really, and the deviations are interesting HW Equilibrium.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
Association tests. Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.
Lecture 11: Linkage Analysis IV Date: 10/01/02  linkage grouping  locus ordering  confidence in locus ordering.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Equilibria in populations
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
L4: Counting Recombination events
Patterns of Linkage Disequilibrium in the Human Genome
Estimating Recombination Rates
Vineet Bafna/Pavel Pevzner
The coalescent with recombination (Chapter 5, Part 1)
Outline Cancer Progression Models
Presentation transcript:

1,3, ,

Mutations are constantly arising in a population Each mutation is eventually lost, either due to elimination, or due to fixation The rate at which this happens depends upon the selective pressure on the mutation. – Selection: the rate at which a carrier is chosen as a parent. Under non-selective forces, the population is likely to be in equilibrium of various sorts

Frequency Count Frequency Scaled Count

Frequenc y Scaled (Normalized) Count * average of 500 simulated population samples (Fu, 1995) Outside of syllabus

Given: Population of diploid individuals and a locus with alleles, A & a 3 Genotypes: AA, Aa, aa Time (generations) Q: Will the frequency of alleles and genotypes remain constant from generation to generation? A a

To the Editor of Science: I am reluctant to intrude in a discussion concerning matters of which I have no expert knowledge, and I should have expected the very simple point which I wish to make to have been familiar to biologists. However, some remarks of Mr. Udny Yule, to which Mr. R. C. Punnett has called my attention, suggest that it may still be worth making... ………. A little mathematics of the multiplication-table type is enough to show ….the condition for this is q 2 = pr. And since q 1 2 = p 1 r 1, whatever the values of p, q, and r may be, the distribution will in any case continue unchanged after the second generation

Suppose, Pr(A)=p, and Pr(a)=1- p=q If certain assumptions are met Large, diploid, population Discrete generations Random mating No selection,… Then, in every generation Aa AA aa

In the next generation Time (generations) A a

Multiple alleles with frequencies – By HW, Multiple loci?

It is observed that 1 in 10,000 Caucasians have the disease phenylketonuria. The disease mutation(s) are recessive. What fraction of the population carries the mutation?

Males are 100 times more likely to have the ‘red’ type of color blindness than females. Why?

Individuals homozygous for S have the sickle-cell disease. In an experiment, the ratios A/A:A/S: S/S were 9365:2993:29. Is HWE violated? Is there a reason for this violation?

Genomic location B/B A/B A/A Genomic location B/B A/B A/A SNP-chips can give us the genotype at each site based on hybridization. Plot the 3 genotypes at each locus on 3 separate horizontal lines. Zoomed Out Picture

SNP-chips can give us the allelic value at each polymorphic site based on hybridization. What is peculiar in the picture? What is your conclusion?

The so called `bread wheat' is hexaploid (6 copies of each chromosome). Consider a locus with 4 allelic values (a; b; c; d ) with frequencies 0: 5; 0: 25; 0: 15; 0: 1, respectively. 1. Compute the number of distinct possible genotypes. 2. Compute the expected number of occurrences of the genotype ab 3 c 2 in a sample of 10,000 individuals, assuming HW equilibrium holds 3. Generalize part (a) to compute the number of distinct genotypes given a ploidy of n (n copies of each chromosome) and m alleles A group of individuals In New York City was genotyped. Would you be surprised if HWE was violated? Males are 100 times more likely to have the ‘red’ type of color blindness than females. Why?

Violation of HWE is common in nature Non-HWE implies that some assumption is violated Figuring out the violated assumption leads to biological insight

The so called `bread wheat' is hexaploid (6 copies of each chromosome). Consider a locus with 4 allelic values (a; b; c; d ) with frequencies 0: 5; 0: 25; 0: 15; 0: 1, respectively. 1. Compute the number of distinct possible genotypes. 2. Compute the expected number of occurrences of the genotype ab 3 c 2 in a sample of 10,000 individuals, assuming HW equilibrium holds 3. Generalize part (a) to compute the number of distinct genotypes given a ploidy of n (n copies of each chromosome) and m alleles A group of individuals In New York City was genotyped. Would you be surprised if HWE was violated? Males are 100 times more likely to have the ‘red’ type of color blindness than females. Why?

The mtDNA is identified directly from the mother Males inherit the y-chromosome directly from their father The genealogical relationship of these chromosomes does not involve recombination – Each individual has a single parent in the previous generation – The genealogy is expressed as a tree. This principle can be used to track ancestry and migration history of a population CSE280Vineet Bafna

Recall that we often study a population in the form of a SNP matrix – Rows correspond to individuals (or individual chromosomes), columns correspond to SNPs – The matrix is binary (why?) – The underlying genealogy is hidden. If the span is large, the genealogy is not a tree any more. Why?

Input: SNP matrix M (n rows, m columns/sites/mutations) Output: a tree with the following properties – Rows correspond to leaf nodes – We add mutations to edges – each edge labeled with i splits the individuals into two subsets. Individuals with a 1 in column i Individuals with a 0 in column i CSE280Vineet Bafna i 1 in position i 0 in position i

Each mutation can be labeled by the column number Goal is to reconstruct the phylogeny genographic atlas

We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. In any tree, each node (except the root) has a single parent. – It is sufficient to construct a parent for every node. In each step, we add a column and refine some of the nodes containing multiple children. Stop if all columns have been considered. CSE280Vineet Bafna

Define i 0 : taxa (individuals) with a 0 at the i’th column Define i 1 : taxa (individuals) with a 1 at the i’th column CSE280Vineet Bafna

For any pair of columns i,j, one of the following holds – i 1  j 1 – j 1  i 1 – i 1  j 1 =  For any pair of columns i,j – i < j if and only if i 1  j 1 Note that if i<j then the edge containing i is an ancestor of the edge containing j CSE280Vineet Bafna i j

A B C D E CSE280Vineet Bafna r A BCDE Initially, there is a single clade r, and each node has r as its parent

Sort columns according to the inclusion property (note that the columns are already sorted here). This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order CSE280Vineet Bafna A B C D E

In adding column i – Check each individual and decide which side you belong. – Finally add a node if you can resolve a clade CSE280Vineet Bafna r A B C D E A B C D E u

Add other columns on edges using the ordering property CSE280Vineet Bafna r E B C D A A B C D E

Important point is that the perfect phylogeny condition does not change when you interchange 1s and 0s at a column. Alg (Unrooted) – Switch the values in each column, so that 0 is the majority element. – Apply the algorithm for the rooted case. – Relabel columns and individuals. Show that this is a correct algorithm. CSE280Vineet Bafna

We transform matrix M to a 0-major matrix M0. if M0 has a directed perfect phylogeny, M has a perfect phylogeny. If M has a perfect phylogeny, does M0 have a directed perfect phylogeny?

Theorem: If M has a perfect phylogeny, there exists a relabeling, and a perfect phylogeny s.t. – Root is all 0s – For any SNP (column), #1s <= #0s – All edges are mutated 0  1 CSE280Vineet Bafna

Consider the perfect phylogeny of M. Find the center: none of the clades has greater than n/2 nodes. – Is this always possible? Root at one of the 3 edges of the center, and direct all mutations from 0  1 away from the root. QED If the theorem is correct, then simply relabeling all columns so that the majority element is 0 is sufficient. CSE280Vineet Bafna

What if there is missing data? (An entry that can be 0 or 1)? What if there are recurrent mutations? CSE280Vineet Bafna

Introgression with Neanderthals We can predict when the introgression event happened, and what regions of the genome have Neanderthal heritage. Science News

Recall that a SNP data-set is a ‘binary’ matrix. – Rows are individual (chromosomes) – Columns are alleles at a specific locus Suppose you have 2 SNP datasets of a contiguous genomic region but no other information – One from an African population, and one from a European Population. – Can you tell which is which? – How long does the genomic region have to be?

Consider sites A &B Case 1: No recombination Each new individual chromosome chooses a parent from the existing ‘haplotype’ AB AB

Consider sites A &B Case 2: diploidy and recombination Each new individual chooses a parent from the existing alleles AB AB

Consider sites A &B Case 1: No recombination Each new individual chooses a parent from the existing ‘haplotype’ – Pr[A,B=0,1] = 0.25 Linkage disequilibrium Case 2: Extensive recombination Each new individual simply chooses and allele from either site – Pr[A,B=(0,1)]=0.125 Linkage equilibrium AB AB

In the absence of recombination, – Correlation between columns – The joint probability Pr[A=a,B=b] is different from P(a)P(b) With extensive recombination – Pr(a,b)=P(a)P(b)

Consider two bi-allelic sites with alleles marked with 0 and 1 Define – P 00 = Pr[Allele 0 in locus 1, and 0 in locus 2] – P 0* = Pr[Allele 0 in locus 1] Linkage equilibrium if P 00 = P 0* P *0 The D-measure of LD – D = (P 00 - P 0* P *0 ) = -(P 01 - P 0* P *1 ) = …

D’ is obtained by dividing D by the largest possible value – Suppose D = (P 00 - P 0* P *0 ) >0. – Then the maximum value of D max = min{P 0* P *1, P 1* P *0 } – If D<0, then maximum value is max{-P 0* P *0, -P 1* P *1 } – D’ = D/ D max Site 1 0 Site D -D D

The p-value of x can be computed by looking up a table for N(μ,σ). Also, the Z-score can be computed as Z=(x-μ)/σ – Z is distributed according to N(0,1) μ σ x

Think of a chi-square distribution as the square of a Normal distribution.

Testing for correlation between two variables. If O = observed value, E = expected value, then the following behaves like a chi-square distributed variable The sum of chi-square variables is also chi- square distributed.

D’ is obtained by dividing D by the largest possible value – Ex: D’ = abs(P 11 - P 1* P *1 )/ D max  = D/(P 1* P 0* P *1 P *0 ) 1/2 Let N be the number of individuals Show that  2 N is the  2 statistic between the two sites Site 1 0 Site P 00 N P 0* N

The statistic behaves like a χ 2 distribution (sum of squares of normal variables). A p-value can be computed directly O1O1 O3O3 O4O4 O2O2

Site P 00 N P 01 N P 10 N P 11 N P 0* P *0 N P 1* P *0 NP 1* P *1 N P 0* P *1 N  = D/(P 1* P 0* P *1 P *0 ) 1/2 Verify that  2 N is the  2 statistic between the two sites

The number of recombination events between two sites, can be assumed to be Poisson distributed. Let r denote the recombination rate between two adjacent sites r = # crossovers per bp per generation The recombination rate between two sites l apart is rl

Decay in LD – Let D (t) = LD at time t between two sites – r’=lr – P (t) 00 = (1-r’) P (t-1) 00 + r’ P (t-1) 0* P (t-1) *0 – D (t) = P (t) 00 - P (t) 0* P (t) *0 = P (t) 00 - P (t-1) 0* P (t-1) *0 (Why?) – D (t) =(1-r’) D (t-1) =(1-r’) t D (0)

Assumption – Recombination rate increases linearly with distance and time – LD decays exponentially. The assumption is reasonable, but recombination rates vary from region to region, adding to complexity This simple fact is the basis of disease association mapping.

Consider a mutation that is causal for a disease. The goal of disease gene mapping is to discover which gene (locus) carries the mutation. Consider every polymorphism, and check: – There might be too many polymorphisms – Multiple mutations (even at a single locus) that lead to the same disease Instead, consider a dense sample of polymorphisms that span the genome

LD decays with distance from the disease allele. By plotting LD, one can short list the region containing the disease gene DNNDDNDNNDDN LD

269 individuals – 90 Yorubans – 90 Europeans (CEPH) – 44 Japanese – 45 Chinese ~1M SNPs

It was found that recombination rates vary across the genome – How can the recombination rate be measured? In regions with low recombination, you expect to see long haplotypes that are conserved. Why? Typically, haplotype blocks do not span recombination hot-spots

Chr 2 region with high r 2 value (implies little/no recombination) History/Genealogy can be explained by a tree ( a perfect phylogeny) Large haplotypes with high frequency are observed

LD is maintained upto 60kb in swedish population, 6kb in Yoruban population Reich et al. Nature 411, (10 May 2001)

D’ was used as the measure between SNP pairs. SNP pairs were classified in one of the following – Strong LD – Strong evidence for recombination – Others (13% of cases) Plot shows fraction of pairs with strong recombination (low LD) This roughly favors out-of- africa. A Coalescent simulation can help give confidence values on this. Gabriel et al., Science 2002

Consider SNPs at genetic distance x (#crossovers per generation) How will the introgressed SNPs behave differently from non-introgressed SNPs? Can we use them to get at time of introgression?

The number of recombination events between two sites, can be assumed to be Poisson distributed. Let x denote the recombination rate between two adjacent sites x = # crossovers in the region per meiosis

S(x) all SNP pairs at genetic distance x. Compute ‘Average LD’ value If the genetic distance is correct, this can be used to give an estimate of the age of the SNP.

CSE280Vineet Bafna Decay in LD – Let D (t) = LD at time t between two sites – P (t) 00 = (1-x) P (t-1) 00 + x P (t-1) 0* P (t-1) *0 – D (t) = P (t) 00 - P (t) 0* P (t) *0 = P (t) 00 - P (t-1) 0* P (t-1) *0 (Why?) – D (t) =(1-x) D (t-1) =(1-x) t D (0)

Recent origin (t is smaller) Ancient origin (t is larger)

Figure 1. Linkage disequilibrium patterns expected due to recent gene flow and ancient structure. Sankararaman S, Patterson N, Li H, Pääbo S, Reich D (2012) The Date of Interbreeding between Neandertals and Modern Humans. PLoS Genet 8(10): e doi: /journal.pgen

Figure 2. Classes of demographic models relating Africans (Y), Europeans (E), and Neandertals (N). Sankararaman S, Patterson N, Li H, Pääbo S, Reich D (2012) The Date of Interbreeding between Neandertals and Modern Humans. PLoS Genet 8(10): e doi: /journal.pgen

Table 1. Estimates of the time of gene flow for different demographic models and mutation rates as well as different ascertainments. Sankararaman S, Patterson N, Li H, Pääbo S, Reich D (2012) The Date of Interbreeding between Neandertals and Modern Humans. PLoS Genet 8(10): e doi: /journal.pgen