Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University.

Slides:



Advertisements
Similar presentations
Population Genetics: Selection and mutation as mechanisms of evolution Population genetics: study of Mendelian genetics at the level of the whole population.
Advertisements

Lab 9: Linkage Disequilibrium. Goals 1.Estimation of LD in terms of D, D’ and r 2. 2.Determine effect of random and non-random mating on LD. 3.Estimate.
Background The demographic events experienced by populations influence their genealogical history and therefore the pattern of neutral polymorphism observable.
A method of quantifying stability and change in a population.
HARDY-WEINBERG and GENETIC EQUILIBRIUM
 Read Chapter 6 of text  Brachydachtyly displays the classic 3:1 pattern of inheritance (for a cross between heterozygotes) that mendel described.
Atelier INSERM – La Londe Les Maures – Mai 2004
Study of Microevolution
The Evolution of Populations. Darwin’s Proposal Individuals are selected; populations evolve. Individuals are selected; populations evolve.
Molecular Evolution Revised 29/12/06
Forward Genealogical Simulations Assumptions:1) Fixed population size 2) Fixed mating time Step #1:The mating process: For a fixed population size N, there.
Population Genetics. Mendelain populations and the gene pool Inheritance and maintenance of alleles and genes within a population of randomly breeding.
Hardy Weinberg: Population Genetics
DATA ANALYSIS Module Code: CA660 Lecture Block 2.
Population Genetics What is population genetics?
Chromosomes Physical structures in which genetic material is organized (DNA and proteins)
Incorporating Mutations
11.1 Genetic Variation Within Population KEY CONCEPT A population shares a common gene pool.
Hardy Weinberg: Population Genetics
 Read Chapter 6 of text  We saw in chapter 5 that a cross between two individuals heterozygous for a dominant allele produces a 3:1 ratio of individuals.
Chapter 23 Population Genetics © John Wiley & Sons, Inc.
Introducing the Hardy-Weinberg principle The Hardy-Weinberg principle is a mathematical model used to calculate the allele frequencies of traits with dominant.
Population Genetics Learning Objectives
Broad-Sense Heritability Index
Genetic Equilibrium Chapter 16- Section 1. What is a population? A group of individuals of the same species that routinely interbreed Population Genetics.
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
14 Population Genetics and Evolution. Population Genetics Population genetics involves the application of genetic principles to entire populations of.
Chapter 7 Population Genetics. Introduction Genes act on individuals and flow through families. The forces that determine gene frequencies act at the.
Course outline HWE: What happens when Hardy- Weinberg assumptions are met Inheritance: Multiple alleles in a population; Transmission of alleles in a family.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Coalescent Models for Genetic Demography
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
The Hardy-Weinberg principle is like a Punnett square for populations, instead of individuals. A Punnett square can predict the probability of offspring's.
The Evolution of Populations. Populations A group of organisms of the same species living in the same area at the same time A population of water buffalo.
Hardy-Weinberg Equilibrium Population Genetics and Evolution.
Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.
Godfrey Hardy ( ) Wilhelm Weinberg ( ) Hardy-Weinberg Principle p + q = 1 Allele frequencies, assuming 2 alleles, one dominant over the.
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
1.Stream A and Stream B are located on two isolated islands with similar characteristics. How do these two stream beds differ? 2.Suppose a fish that varies.
Chapter 23: Evaluation of the Strength of Forensic DNA Profiling Results.
Testing the Neutral Mutation Hypothesis The neutral theory predicts that polymorphism within species is correlated positively with fixed differences between.
Exercise 1 DNA identification. To which population an individual belongs? Two populations of lab-mice have been accidentally put in a same cage. Your.
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
By Bryce Perry and Cecil Brown
The Hardy-Weinberg theorem describes the gene pool of a nonevolving population. This theorem states that the frequencies of alleles and genotypes in a.
Modern Evolutionary Biology I. Population Genetics A. Overview Sources of VariationAgents of Change MutationN.S. Recombinationmutation - crossing over.
Population Genetics I. Basic Principles. Population Genetics I. Basic Principles A. Definitions: - Population: a group of interbreeding organisms that.
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
Please feel free to chat amongst yourselves until we begin at the top of the hour.
Population Genetics Measuring Evolutionary Change Over Time.
Robert Page Doctoral Student in Dr. Voss’ Lab Population Genetics.
KEY CONCEPT A population shares a common gene pool.
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
Population Genetics: Selection and mutation as mechanisms of evolution
HARDY-WEINBERG and GENETIC EQUILIBRIUM
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Allele Frequencies Genotype Frequencies The Hardy-Weinberg Equation
KEY CONCEPT A population shares a common gene pool.
KEY CONCEPT A population shares a common gene pool.
KEY CONCEPT A population shares a common gene pool.
Lecture 4: Testing for Departures from Hardy-Weinberg Equilibrium
HARDY-WEINBERG and GENETIC EQUILIBRIUM
Hardy Weinberg: Population Genetics
KEY CONCEPT A population shares a common gene pool.
Hardy Weinberg What the heck is that?.
KEY CONCEPT A population shares a common gene pool.
KEY CONCEPT A population shares a common gene pool.
KEY CONCEPT A population shares a common gene pool.
A population shares a common gene pool.
KEY CONCEPT A population shares a common gene pool.
Presentation transcript:

Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University

DNA Barcoding is great! But it is useful to keep in mind that species taxa are provisional – they are hypothesis to be revised with more dataBut it is useful to keep in mind that species taxa are provisional – they are hypothesis to be revised with more data Taxa are tools, not truthTaxa are tools, not truth Mitochondrial-based DNA barcodesMitochondrial-based DNA barcodes –Can be misleading due to chance factors (different genes have different histories) –Can be misleading due to deterministic factors (mitochondria are a large target for natural selection)

A general problem… For example, a gene sequenced multiple times Or a microsatellite locus genotyped in a number of individuals Suppose you are willing to assume that positive or balancing selection has not played a big role in the history of the data What could you figure out about the history of the organisms from which the genes came? You have some genetic data

A General Parameterization for questions on population demography, population divergence, speciation, population identification etc Xgenetic data (e.g. aligned sequences, microsatellites) may (or may not) come with population labels may (or may not) be given as diploid genotypes may include multiple loci for each sampled organism Ppopulation phylogeny Tsplitting times – i.e. the times of branch points in the phylogeny P ΘDemography - population size and migration rate parameters IPopulation labels – assignment of genes to populations - which genes came from which populations or species GGenealogy – the gene tree for the data G is a necessary ‘nuisance’ parameter – it provides a mathematical connection between X and (P,T, Θ and I) It is possible to calculate the probability of G as a function of P,T, Θ and I, p(G| P,T, Θ,I), using coalescent models It is possible to calculate the probability of a data set given G, p(X|G), using mutation models.

Sequence1 ACgTACgACgCACgAAT Sequence2 ACgTACgACgCACgAAT Sequence3 ACCTTCgACgTACgAGT Sequence4 ACgTTCgACgTACgAAT Sequence5 ACCTTCgACgTACgAAT Sequence6 ACgTTCgACgTATgAAT Specify a random G with topology and branch lengths for example : for example : Unlabeled Data With a mutation model, and a value of G, we can calculate the probability of G given the data: p(G|X) Connecting Data to the General Model – Parts 1-3 For unlabeled data - without information on the number of populations, or on which populations were sampled

Connecting Data to the General Model – Parts 4&5 With a phylogeny that depicts populations in time, we can also pick random values for population sizes and migration rates – Θ = {N 1, N 2... m 1>2, m 2>1 …} 4 5 Specify a random phylogeny P with multiple populations and with splitting times T … for example: ← T2← T2← T2← T2 ← T1← T1← T1← T1 Pop 1 Pop (2,3),1 Pop (2,3) Pop 3 Pop 2 N1N1N1N1 N2N2N2N2 N3N3N3N3 N (2,3) N (2,3),1

Connecting Data to the General Model – Parts 6-8 add implied migration events and other random migration events to the phylogeny 6 7 Overlay the genealogy on the phylogeny Identify I, the data labels representing the populations containing the data Pop 3 Pop 2 Pop 1

Calculating the likelihood of P, T, Θ, and I, given the data If we can solve this then we can obtain maximum likelihood estimates of P,T, I and ΘIf we can solve this then we can obtain maximum likelihood estimates of P,T, I and Θ We know how to calculate p(X|G) and p(G|P,T,I,Θ)We know how to calculate p(X|G) and p(G|P,T,I,Θ) –The math is not the hard part The greatest challenge is finding efficient ways to sample the space of genealogies and the space of P, T, Θ, and IThe greatest challenge is finding efficient ways to sample the space of genealogies and the space of P, T, Θ, and I

Genetic Data and different types of data labels ACgTACgACgCACgAATACgTACgACgCACgAATACCTTCgACgTACgAGTACgTTCgACgTACgAATACCTTCgACgTACgAATACgTTCgACgTATgAAT Often Population Labels are known (come with data) Often Population Labels are known (come with data) Aligned DNA Sequences Population Labels AABBCC Population labels are already known and do not need to be estimated. Parameter I (population labels) is not included in the model.

ACgTACgACgCACgAATACgTACgACgCACgAATACCTTCgACgTACgAGTACgTTCgACgTACgAATACCTTCgACgTACgAATACgTTCgACgTATgAAT Case 1 Data has no labeling at all ?????? Aligned DNA Sequences Population Labels

ACgTACgACgCACgAATACgTACgACgCACgAATACCTTCgACgTACgAGTACgTTCgACgTACgAATACCTTCgACgTACgAATACgTTCgACgTATgAAT Case 2, no population labels, but data comes in diploid genotypes pairs Aligned DNA Sequences Population Labels |Genotype Pairs Individual #1 Individual #1 Individual #2 Individual #3 Individual #3 Gene copies are identified in genotype pairs only. Parameter I (Population labels) is unknown (?) and needs to be estimated. ??????

Two kinds of data sets without population labels 1.Alleles or gene copies provided without any additional information on populations - e.g. locus may be haploid - or for whatever reason, data not collected in a way that yields diploid genotypes 2. Alleles or sequences provided in diploid (genotype) pairs This is a common situation for population assignment

Case 1: Alleles or gene copies come without any additional information on populations The only available information on population labels (parameter I) and all other parameters (P, T, Θ) is in the actual variation in the dataThe only available information on population labels (parameter I) and all other parameters (P, T, Θ) is in the actual variation in the data This is a lot to ask of single locus data set.This is a lot to ask of single locus data set. With multiple loci, can be possible to to estimate P, T, Θ, and IWith multiple loci, can be possible to to estimate P, T, Θ, and I Can include information from a database on the same locus (loci) – i.e. DNA barcodingCan include information from a database on the same locus (loci) – i.e. DNA barcoding

Case 2: Data comes in diploid (genotype) pairs Such data contains two types of information for population identification:Such data contains two types of information for population identification: –Patterns of variation (as in case 1) –Knowledge that both gene copies from a single individual must come from the same population (assume no hybrids) This problem (identifying populations based on diploid genotypes) is traditionally called population assignmentThis problem (identifying populations based on diploid genotypes) is traditionally called population assignment

Population Assignment based on diploid genotype data Many methods exist for population assignment, using allelic data, based on an assumption of Hardy-Weinberg equilibrium within populationsMany methods exist for population assignment, using allelic data, based on an assumption of Hardy-Weinberg equilibrium within populations These methods do not otherwise incorporate phylogenetics or population genetics (no P, T, or Θ)These methods do not otherwise incorporate phylogenetics or population genetics (no P, T, or Θ) Have to overcome difficulty of not knowing the underlying allele frequenciesHave to overcome difficulty of not knowing the underlying allele frequencies

Considering the probability of a particular genotype configuration, D 1 ACgTACgACgCACgAAT 2 ACgTACgACgCACgAAT 3 ACCTTCgACgTACgAGT 4 ACgTTCgACgTACgAAT 5 ACCTTCgACgTACgAAT 6 ACgTTCgACgTATgAAT 6 Sequences 3 Genotype pairs The actual configuration D that comes with the data is one of many possible configurations.

Calculating the probability of a particular genotype configuration, D Assume that genes come together and form zygotes at random with respect to their time of common ancestryAssume that genes come together and form zygotes at random with respect to their time of common ancestry –This is a genealogical version of the assumption of random mating that is usually made with respect to segregating alleles (e.g. in Hardy Weinberg) Assume that both gene copies within an individual are in the same populationAssume that both gene copies within an individual are in the same population

Given a genealogy, G, Some genotype configurations are more probable than others under an assumption of random union of gametes