Wi’08Structure Population sub-structure. Wi’08Structure Projects Harish/Nitin Gaurav (Tuesday) Stefano/Hossein (Tuesday) Nisha/Yu David Jian/Josue (Tuesday)

Slides:



Advertisements
Similar presentations
Linkage and Genetic Mapping
Advertisements

Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
Single Nucleotide Polymorphism Copy Number Variations and SNP Array Xiaole Shirley Liu and Jun Liu.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Basics of Linkage Analysis
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
Genomics An introduction. Aims of genomics I Establishing integrated databases – being far from merely a storage Linking genomic and expressed gene sequences.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
CSE182-L18 Population Genetics. Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
CSE182-L17 Clustering Population Genetics: Basics.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
SNP Selection University of Louisville Center for Genetics and Molecular Medicine January 10, 2008 Dana Crawford, PhD Vanderbilt University Center for.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Non-Mendelian Genetics
E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna Population data Recall that we often study a population in the form of a SNP matrix – Rows.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1, Elango Cheran 1, and Michael Brudno 1 1 University of Toronto,
Gene Hunting: Linkage and Association
CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
INTRODUCTION TO ASSOCIATION MAPPING
Recombination based population genomics Jaume Bertranpetit Marta Melé Francesc Calafell Asif Javed Laxmi Parida.
Identification of Copy Number Variants using Genome Graphs
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
CSE280Vineet Bafna In a ‘stable’ population, the distribution of alleles obeys certain laws – Not really, and the deviations are interesting HW Equilibrium.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Lecture 11: Linkage Analysis IV Date: 10/01/02  linkage grouping  locus ordering  confidence in locus ordering.
Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.
The Haplotype Blocks Problems Wu Ling-Yun
Population stratification
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Single Nucleotide Polymorphisms (SNPs
Common variation, GWAS & PLINK
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
L4: Counting Recombination events
Estimating Recombination Rates
Vineet Bafna/Pavel Pevzner
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Wi’08Structure Population sub-structure

Wi’08Structure Projects Harish/Nitin Gaurav (Tuesday) Stefano/Hossein (Tuesday) Nisha/Yu David Jian/Josue (Tuesday)

Wi’08Structure Population sub-structure confounds association Consider an association test in a recently admixed population Suppose location A increases susceptibility to a disease common in Africans. Let locus B associate with skin-color. As expected, locus A is associated with the disease Unexpectedly, locus B also shows association The problem arises because the assumption of a random mating population is no longer true – The population has structure! D N D N D N D D N A B African European

Wi’08Structure Population sub-structure can increase LD Consider two populations that were isolated and evolving independently. They might have different allele frequencies in some regions. Pick two regions that are far apart (LD is very low, close to 0) Pop. A Pop. B p 1 =0.1 q 1 =0.9 P 11 =0.1 D=0.01 p 1 =0.9 q 1 =0.1 P 11 =0.1 D=0.01

Wi’08Structure Recent ad-mixing of population If the populations came together recently (Ex: African and European population), artificial LD might be created. D = 0.15 (instead of 0.01), increases 10-fold This spurious LD might lead to false associations Other genetic events can cause LD to increase, and one needs to be careful Pop. A+B p 1 =0.5 q 1 =0.5 P 11 =0.1 D= =0.15

Wi’08Structure Determining population sub-structure Given a mix of people, can you sub-divide them into ethnic populations. Turn the ‘problem’ of spurious LD into a clue. – Find markers that are too far apart to show LD – If they do show LD (correlation), that shows the existence of multiple populations. – Sub-divide them into populations so that LD disappears.

Wi’08Structure Determining Population sub-structure Same example as before: The two markers are too similar to show any LD, yet they do show LD. However, if you split them so that all 0..1 are in one population and all 1..0 are in another, LD disappears

Wi’08Structure Iterative algorithm for population sub- structure Define N = number of individuals (each has a single chromosome) k = number of sub-populations. Z  {1..k} N is a vector giving the sub-population. – Z i =k’ => individual i is assigned to population k’ X i,j = allelic value for individual i in position j P k,j,l = frequency of allele l at position j in population k

Wi’08Structure Example Ex: consider the following assignment P 1,1,0 = 0.9 P 2,1,0 = P 1,1,0

Wi’08Structure Goal X is known. P, Z are unknown. The goal is to estimate Pr(P,Z|X) Various learning techniques can be employed. – max P,Z Pr(X|P,Z) (Max likelihood estimate) – max P,Z Pr(X|P,Z) Pr(P,Z) (MAP) – Sample P,Z from Pr(P,Z|X) Here a Bayesian (MCMC) scheme is employed to sample from Pr(P,Z|X). We will only consider a simplified version

Wi’08Structure Algorithm:Structure Iteratively estimate – (Z (0),P (0) ), (Z (1),P (1) ),.., (Z (m),P (m) ) After ‘convergence’, Z (m) is the answer. Iteration – Guess Z (0) – For m = 1,2,.. Sample P (m) from Pr(P | X, Z (m-1) ) Sample Z (m) from Pr(Z | X, P (m) ) How is this sampling done? (Z 1,P 1 ) (Z 3,P 3 )(Z 4,P 4 )(Z 2,P 2 )

Wi’08Structure Example Choose Z at random, so each individual is assigned to be in one of 2 populations. See example. Now, we need to sample P (1) from Pr(P | X, Z (0) ) Simply count N k,j,l = number of people in population k which have allele l in position j p k,j,l = N k,j,l / N

Wi’08Structure Example N k,j,l = number of people in population k which have allele l in position j p k,j,l = N k,j,l / N k,j,* N 1,1,0 = 4 N 1,1,1 = 6 p 1,1,0 = 4/10 p 1,2,0 = 4/10 Thus, we can sample P (m)

Wi’08Structure Sampling Z Pr[Z 1 = 1] = Pr[”01” belongs to population 1]? We know that each position should be in linkage equilibrium and independent. Pr[”01” |Population 1] = p 1,1,0 * p 1,2,1 =(4/10)*(6/10)=(0.24) Pr[”01” |Population 2] = p 2,1,0 * p 2,2,1 = (6/10)*(4/10)=0.24 Pr [Z 1 = 1] = 0.24/( ) = 0.5 Assuming, HWE, and LE

Wi’08Structure Sampling Suppose, during the iteration, there is a bias. Then, in the next step of sampling Z, we will do the right thing Pr[“01”| pop. 1] = p 1,1,0 * p 1,2,1 = 0.7*0.7 = 0.49 Pr[“01”| pop. 2] = p 2,1,0 * p 2,2,1 =0.3*0.3 = 0.09 Pr[Z 1 = 1] = 0.49/( ) = 0.85 Pr[Z 6 = 1] = 0.49/( ) = 0.85 Eventually all “01” will become 1 population, and all “10” will become a second population

Wi’08Structure Allowing for admixture Define q i,k as the fraction of individual i that originated from population k. Iteration – Guess Z (0) – For m = 1,2,.. Sample P (m),Q (m) from Pr(P,Q | X, Z (m-1) ) Sample Z (m) from Pr(Z | X, P (m),Q (m) )

Wi’08Structure Estimating Z (admixture case) Instead of estimating Pr(Z(i)=k|X,P,Q), (origin of individual i is k), we estimate Pr(Z(i,j,l)=k|X,P,Q) i,1 i,2 j

Wi’08Structure Results: Thrush data For each individual, q(i) is plotted as the distance to the opposite side of the triangle. The assignment is reliable, and there is evidence of admixture.

Wi’08Structure NJ versus Structure:thrush data Objective function is different in standard clustering algorithms!

Wi’08Structure Population Structure in isolated human populations 377 locations (loci) were sampled in 1000 people from 52 populations. 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) Africa EurasiaEast Asia America Oceania

Wi’08Structure Population sub-structure:research problem Systematically explore the effect of admixture. Can admixture be predicted for a locus, or for an individual The sampling approach may or may not be appropriate. Formulate as an optimization/learning problem: – (w/out admixture). Assign individuals to sub- populations so as to maximize linkage equilibrium, and hardy weinberg equilibrium in each of the sub- populations – (w/ admixture) Assign (individuals, loci) to sub- populations

Wi’08Structure Population Substructure Conclusions Populations substructure (violating the assumption of random mating) can create artificial linkage disequlibrium, and false associations One way out of it is to identify and sub-divide the populations into sub-populations. Another approach is to ‘correct’ for the effects of structure.

Wi’08Structure Detecting structural variations

Wi’08Structure Genomic variation Inherited phenotypes arise from genetic variation – Point mutations, indels, genetic recombination, but also…. – Structural Variation Insertions, deletions, translocations, inversions, fissions, fusions…. deletions inversions translocations

Wi’08Structure Fluoroscent in situ hybridization (Cancer genomes show extensive structural variation) Historically, larger structural variations (easily observed under a microscope were commonly studied, mostly in the context of diseases…

Wi’08Structure HapMap and other projects The development of molecular techniques (sequencing, genotyping) shifted interest onto smaller scale mutations. They were assumed to be the dominant source of genetic variation It turns out that structural variation in normal populations is more common than assumed.

Wi’08Structure 1. Inversions and HapMap A A A A A AC C G G G G T T T G G G AG A A A G A AG GA CT CA SNP data is seemingly oblivious to Inversions (& other structural polymorphisms) Can we detect a signal for inversion polymorphisms in SNPs? A A A A A G G G G G G G G G reference genome sequence population

Wi’08Structure Array-CGH Garnis et al. Molecular Cancer 2004

Wi’08Structure CNVs dominate, possibly because of technology

Wi’08Structure Structural variations Structural variants are common, but non CNVs are hard to detect. Some of the translocations are important in disease

Wi’08Structure 2. Variations in tumor genomes Fusion observed in leukemia, lymphoma, and sarcomas “Philadelphia Translocation” Drugs target this fusion protein Can we detect fused genes in tumors?

Wi’08Structure 3. Assaying for specific variations Most tumors start off with a single cell, which then proliferate. Drugs like Gleevec are used well after cancer has taken hold. Can we detect the cancer early by detecting the genomic abnormality? – If a very few cells in the person are cancerous, can we still detect it? Can we track a patient through his treatment?

Wi’08Structure Today Detection of inversion polymorphisms (& other copy neutral variation) using genotype data. Detection of gene fusion events using sequencing PCR based detection of variation in the presence of heterosomy: only a fraction of cells carry the variant genome

Wi’08Structure Chr. 17 Inversion

Wi’08Structure Inversion polymorphisms

Wi’08Structure Common inverted haplotype

Wi’08Structure LLR statistic computation Multi-allelic markers are chosen in blocks around the putative breakpoints LD is measured using multi-allelic D-statistic M1M1 M2M2 M3M3 M4M4 LD 24 LD 34 LD 13 LD 12 Bansal et al. Genome Research, 2007

Wi’08Structure LLR l M1M1 M2M2 M3M3 M4M4 LD 13 d 12 d 13

Wi’08Structure LLR R M1M1 M2M2 M3M3 M4M4 LD 24 LD 34 d 12 d 24

Wi’08Structure Inversion polymorphisms in human LLRl and LLRr define our inversion statistic Large positive values for both likelihood ratios indicate inversion A permutation test to estimate the significance of the statistic Overlapping breakpoints are merged

Wi’08Structure Power Difficult to simulate inversions using the coalescent We simulated inversions on HapMap data. (a) varying frequency, 500kb inversion (b) varying length

Wi’08Structure Inversion polymorphisms We used the Phase I phased haplotype data from the International HapMap project – 3 samples: CEU, YRI, Asian: CHB+JPT about 900K SNPs The location between two adjacent SNPs was considered a potential inversion breakpoint The statistic was computed for every pair of inversion breakpoints within a certain distance (200kb - 6Mb) 176 inversion polymorphisms were detected

Wi’08Structure Inversions with secondary evidence 18 regions had highly homologous repeat sequence, 11 with inverted repeats (p-value against random distribution of inversion breakpoints)

Wi’08Structure Inversions with secondary evidence

Wi’08Structure 1.4Mb inversion at 16p12(CEU+YRI) Supported by fosmid pair evidence (Tuzun et al., 2005) 80kb long inverted repeat Identical breakpoints in CEU and YRI data Known chimp repeat

Wi’08Structure 1.2Mb Chr 10 (CHB+JPT) Inversion supported by fosmid evidence (Tuzun et al. 2005)

Wi’08Structure Chr 6 Inversion(YRI) & TCBA1 66 inversion breakpoints covered by genes Less than expected by chance (p-value 0.02) Many overlapping genes are known to be disrupted in disease T-cell lymphoma breakpoint associated target (TCBA1) – Structurally disrupted in T-cell lymphoma cell-lines

Wi’08Structure ICAp69 Islet cell antigen (ICAp69) gene

Wi’08Structure Inversion Polymorphism summary Many inversions (and copy neutral) polymorphisms remain to be detected. The genotype LD signal is useful but has low power. – We are investigating other signals to improve detection Some of these variations lead to genes fusing.

Wi’08Structure Conclusions for the class Genetic variations are important to our understanding of evolution and genotype- phenotype relationship

Wi’08Structure Topics covered 1. We started (and finished) by considering sources of variation 2. Models of evolution of population under natural assumptions 1. HW/Linkage equlibrium 2. Efficient simulation of populations via coalescent theory 3. Evolution under recombinations/recombination hot- spot detection via counting of recombination events 4. Detection of regions under selection 5. Association testing 6. Population sub-structure 7. Detecting structural variation

Wi’08Structure Algorithmic diversions 1. Phylogeny reconstruction (perfect phylogeny, distance-based methods) 2. Optimization using (Integer-) Linear programming, simulated annealing and other paradigms 3. Greedy algorithms, dynamic programming 4. Stochastic sampling methods (MCMC, Gibbs sampling) 5. Statistical tests for significance 6. Efficient simulation techniques 1. Coalescent for populations 2. Simulating genotype/phenotype associations