Lecture 14: Population structure and Population Assignment October 12, 2012.

Slides:

Advertisements

Similar presentations

1 BI3010H08 Population genetics Halliburton chapter 9 Population subdivision and gene flow If populations are reproductible isolated their genepools tend.

Advertisements

Lab 9: Linkage Disequilibrium. Goals 1.Estimation of LD in terms of D, D’ and r 2. 2.Determine effect of random and non-random mating on LD. 3.Estimate.

Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.

Lab 3 : Exact tests and Measuring of Genetic Variation.

Lab 3 : Exact tests and Measuring Genetic Variation.

Evaluation of a new tool for use in association mapping Structure Reinhard Simon, 2002/10/29.

METHODS FOR HAPLOTYPE RECONSTRUCTION

Lecture 16: Individual Identity and Paternity Analysis March 7, 2014.

Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.

Basics of Linkage Analysis

MALD Mapping by Admixture Linkage Disequilibrium.

. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.

. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.

DNA fingerprinting Every human carries a unique set of genes (except twins!) The order of the base pairs in the sequence of every human varies In a single.

Population Genetics I. Evolution: process of change in allele

Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Computational Statistical Genetics.

Population Genetics What is population genetics?

Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,

CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner

Assigning individuals to ethnic groups based on 13 STR loci X. Fosella 1, F. Marroni 1, S. Manzoni 2, A. Verzeletti 2, F. De Ferrari 2, N. Cerri 2, S.

Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides

Review Session Monday, November 8 Shantz 242 E (the usual place) 5:00-7:00 PM I’ll answer questions on my material, then Chad will answer questions on.

Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides

Molecular phylogenetics

Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )

Population Genetics is the study of the genetic

Population Stratification

Lecture 13: Population Structure October 5, 2015.

Experimental Design and Data Structure Supplement to Lecture 8 Fall

Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

INTRODUCTION TO ASSOCIATION MAPPING

Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.

Lab 7. Estimating Population Structure. Goals 1.Estimate and interpret statistics (AMOVA + Bayesian) that characterize population structure. 2.Demonstrate.

Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.

Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.

Lecture 6: Inbreeding September 4, Last Time uCalculations  Measures of diversity and Merle patterning in dogs  Excel sheet posted uFirst Violation.

Lecture 13: Population Structure

Genetic differentiation of caribou herds and reindeer in Northern Alaska Karen H. Mager, Kevin E. Colson, and Kris J. Hundertmark Institute of Arctic Biology,

Lecture 14: Population Assignment and Individual Identity October 8, 2015.

Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.

What happens to genes and alleles of genes in populations? If a new allele appears because of a mutation, does it… …immediately disappear? …become a permanent.

Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.

Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.

Lab 7. Estimating Population Structure

Mammalian Population Genetics

Chapter 2: Bayesian hierarchical models in geographical genetics Manda Sayler.

Individual Identity and Population Assignment Lab. 8 Date: 10/17/2012.

Understanding Principle Component Approach of Detecting Population Structure Jianzhong Ma PI: Chris Amos.

Lecture 6: Inbreeding September 10, Announcements Hari’s New Office Hours  Tues 5-6 pm  Wed 3-4 pm  Fri 2-3 pm In computer lab 3306 LSB.

Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.

Robert Page Doctoral Student in Dr. Voss’ Lab Population Genetics.

Lecture 15: Individual Identity and Forensics October 17, 2011.

Lecture Slides Elementary Statistics Twelfth Edition

Lecture 15: Individual Identity and Paternity Analysis

Classification of unlabeled data:

Imputation-based local ancestry inference in admixed populations

Haplotype Reconstruction

Population Genetic Structure of the People of Qatar

Volume 26, Issue 7, Pages (April 2016)

Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia Kevin J. Galinsky, Gaurav Bhatia, Po-Ru Loh, Stoyan Georgiev,

Vineet Bafna/Pavel Pevzner

genetic variation is meaningful only in the context of a population

Proportioning Whole-Genome Single-Nucleotide–Polymorphism Diversity for the Identification of Geographic Population Structure and Genetic Ancestry Oscar.

Goals: To identify subpopulations (subsets of the sample with distinct allele frequencies) To assign individuals (probabilistically) to subpopulations.

Brian P. McEvoy, Joanne M. Lind, Eric T. Wang, Robert K

Shuhua Xu, Wei Huang, Ji Qian, Li Jin

Brian P. McEvoy, Joanne M. Lind, Eric T. Wang, Robert K

Population Genetic Structure of the People of Qatar

Presentation transcript:

Lecture 14: Population structure and Population Assignment October 12, 2012

Lab 7 Update uCorrected instructions for lab 7 will be posted today uProblem 1: consider relative levels of F-statistics as well as significance from bootstrapping uUp to 3 points extra credit if problem 2 is done correctly uSee lab open hours schedule on lab web page uCaveat: exams and class usage of lab uOther computers are available: see Hari or me

America Africa Eurasia East Asia Oceania Population structure from worldwide human population Population = subpopulation. Group = Regions

Lab 7 Revised Problem 1 Problem 1. File human_struc.xls contains data for 10 microsatellite loci used to genotype 41 human populations from a worldwide sample. a.) Convert the file into Arlequin format and perform AMOVA based on this grouping of populations within regions using distance. How do you interpret these results? Report values of the phi-statistics and their statistical significance for each AMOVA you run. b.) Do you think that any of these regions can justifiably be divided into subregions? Pick a region, form a hypothesis for what would be a reasonable grouping of populations into subregions, then run AMOVA only for the region you selected using distance measures. Was your hypothesis supported by the data? c.) GRADUATE STUDENTS: Which of the 5 initially defined regions has the highest diversity in terms of effective number of alleles? What is your biological explanation for this?

Lab 7 Original Problem 2 (worth 8 points if you answer this). Use Structure to further test the hypotheses you developed in Problem 1. a.) Calculate the posterior probabilities to test whether: i. All populations form a single genetically homogeneous group. ii. There are two genetically distinct groups within your selected region iii. There are three genetically distinct groups within your selected region. b.) Use the ΔK method to determine the most likely number of groups. How does this compare to the method based on posterior probabilities? c.) How do the groupings of subpopulations compare to your expectations from Problem 1? d.) Is there evidence of admixture among the groups? If so, include a table or figure showing the proportion of each subpopulation assigned to each group. e.) GRADUATE STUDENTS: Provide a brief, literature-based explanation for the groupings you observe.

Last Time uSample calculation of F ST uDefining populations on genetic criteria: introduction to Structure

Today  Interpretation of F-statistics  More on the Structure program  Principal Components Analysis  Population assignment

F ST : What does it tell us?  Degree of differentiation of subpopulations  Rules of thumb:  0.05 to 0.15 is weak to moderate  0.15 to 0.25 is strong differentiation  >0.25 is very strong differentiation  Related to the historical level of gene exchange between populations  May not represent current conditions

F ST is related to life history Seed Dispersal Gravity Explosive/capsule0.262 Winged/Plumose (Loveless and Hamrick, 1984) Successional Stage Early0.411 Middle0.184 Late Life Cycle Annual0.430 Short-lived0.262 Long-lived0.077

Structure Program  One of the most widely-used programs in population genetics (original paper cited >8,000 times since 2000)  Very flexible model can determine:  The most likely number of uniform groups (populations, K)  The genomic composition of each individual (admixture coefficients)  Possible population of origin

 Individuals in our sample represent a mixture of K (unknown) ancestral populations.  Each population is characterized by (unknown) allele frequencies at each locus.  Within populations, markers are in Hardy-Weinberg and linkage equilibrium.  Roughly speaking, the model sorts individuals into K clusters so as to minimize departures from HWE and Linkage Equilibrium. A simple model of population structure Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting

More on the model... lLet A 1, A 2, …, A K represent the (unknown) allele frequencies in each subpopulation lLet Z 1, Z 2, …, Z m represent the (unknown) subpopulation of origin of the sampled individuals lAssuming Hardy-Weinberg and linkage equilibrium within subpopulations, the likelihood of an individual’s genotype in subpopulation k is given by the product of the relevant allele frequencies: Where P l is probability of observing genotype l at a particular locus in subpopulation k Pr(G i | Z i = k, A k ) =  P l loci Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting

Probability of observing a genotype in a subpopulation  Probability of observing a genotype at locus l by chance in population is a function of allele frequencies: for m loci Homozygote Heterozygote  Assumes unlinked (independent loci) and Hardy- Weinberg equilibrium

 If we knew the population allele frequencies in advance, then it would be easy to assign individuals.  If we knew the individual assignments, it would be easy to estimate frequencies.  In practice, we don’t know either of these, but the following MCMC algorithm converges to sensible joint estimates of both. Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting

MCMC algorithm (for fixed K)  Start with random assignment of individuals to populations  Step 1: Gene frequencies in each population are estimated based on the individuals that are assigned to it.  Step 2: Individuals are assigned to populations based on gene frequencies in each population.  And this is repeated...  …Estimation of K performed separately. Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting

Admixed individuals are mosaics of ancestry from the original populations AncestralPopulations Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting

The two basic ancestry models used by structure.  No Admixture: each individual is derived completely from a single subpopulation  Admixture: individuals may have mixed ancestry: some fraction q k of the genome of individual i is derived from subpopulation k. The admixture model allows for hybrids, but it is more flexible and often provides a better fit for complicated structure. This is what we used in lab. Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting

Notes on Estimating the Number of Subpopulations (k) uLikelihood-based method is the simplest, but likelihood often increases continuously with k uMore variability at values of k beyond “natural” value uEvanno et al. (2005) method measures change in likelihood and discounts for variation uUse biological reasoning at arriving at final value uPriors based on population locations, other information uOften need to do hierarchical analyses: break into subregions and run Structure separately for each

Inferred human population structure Each individual is a thin vertical line that is partitioned into K colored segments according to its membership coefficients in K clusters. Africans Europeans MidEast Cent/S Asia Asia Oceania America Rosenberg et al Science 298:

Structure is Hierarchical: Groups reveal more substructure when examined separately Rosenberg et al Science 298:

Alternative clustering method: Principal Components Analysis  Structure is very computationally intensive  Often no clear best-supported K-value  Alternative is to use traditional multivariate statistics to find uniform groups  Principal Components Analysis is most commonly used algorithm  EIGENSOFT (PCA, Patterson et al., 2006; PloS Genetics 2:e190). Eckert, Population Structure, 5-Aug

Principal Components Analysis  Efficient way to summarize multivariate data like genotypes  Each axis passes through maximum variation in data, explains a component of the variation  ech4710/pca/s1.htm

How do we identify population of origin?

Human Population Assignment with SNP  Assayed 500,000 SNP genotypes for 3,192 Europeans  Used Principal Components Analysis to ordinate samples in space  High correspondence betweeen sample ordination and geographic origin of samples  Individuals assigned to populations of origin with high accuracy  Novembre et al Nature 456:98

Likelihood Approaches  Allow evaluation of alternative hypotheses by comparing their relative likelihoods given the evidence  In a population assignment or forensic context, definition of the competing hypothesis is the most essential component

Population Assignment: Likelihood  Assume you find skin cells and blood under fingernails of a murder victim  Victim had major debts with the Sicilian mafia as well as the Chinese mafia  Can population assignment help to focus investigation?  What is H 1 and what is H 2 ?

Population Assignment: Likelihood  "Assignment Tests" based on allele frequencies in source populations and genetic composition of individuals  Likelihood-Based Approaches  Calculate likelihood that individual genotype originated in particular population  Assume Hardy-Weinberg and linkage equilibria  Genotype frequencies corrected for presence of sampled individual  Usually reported as log 10 likelihood for origin in given population relative to other population  Implemented in ‘GENECLASS’ program ( class/geneclass.html) for m loci for homozygote A i A i in population l at locus k for heterozygote A i A j in population l at locus k

Power of Population Assignment using Likelihood  Assignment success depends on:  Number of markers used  Polymorphism of markers  Number of possible source populations  Differentiation of populations  Accuracy of allele frequency estimations  Rules of Thumb (Cornuet et al. 1999) for 100% assignment success, for 10 reference populations need:  30 to 50 reference individuals per population  10 microsatellite loci  HE > 0.6  FST > 0.1

Knowing what you know about human population genetics, is it worth the effort to assign our skin sample to Asian or Sicilian populations?  Rules of Thumb (Cornuet et al. 1999) for 100% assignment success, for 10 reference populations need:  30 to 50 reference individuals per population  10 microsatellite loci  HE > 0.6  FST > 0.1

Carmichael et al Mol Ecol 10:2787 Population Assignment Example: Wolf Populations in Northwest Territories  Wolf populations sampled on island and mainland populations in Canadian Northwest Territories  Immigrants detected on mainland (black circles) from Banks Island (white circles)

Population Assignment Example:Fish Stories  Fishing competition on Lake Saimaa in Southeast Finland  Contestant allegedly caught a 5.5 kg salmon, much larger than usual for the lake  Compared fish from the lake to fish from local markets (originating from Norway and Baltic sea)  7 microsatellites  Based on likelihood analysis, fish was purchased rather than caught in lake Lake Saimaa Market -