Statistical Methods for Quantitative Trait Loci (QTL) Mapping II

Slides:



Advertisements
Similar presentations
Linkage and Genetic Mapping
Advertisements

Planning breeding programs for impact
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Experimental crosses. Inbred Strain Cross Backcross.
Genetic research designs in the real world Vishwajit L Nimgaonkar MD, PhD University of Pittsburgh
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Basics of Linkage Analysis
QTL Mapping R. M. Sundaram.
MALD Mapping by Admixture Linkage Disequilibrium.
1 QTL mapping in mice Lecture 10, Statistics 246 February 24, 2004.
Quantitative Genetics
Karl W Broman Department of Biostatistics Johns Hopkins University Gene mapping in model organisms.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
QTL mapping in animals. It works QTL mapping in animals It works It’s cheap.
Methods of Genome Mapping linkage maps, physical maps, QTL analysis The focus of the course should be on analytical (bioinformatic) tools for genome mapping,
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Gene, Allele, Genotype, and Phenotype
CS177 Lecture 10 SNPs and Human Genetic Variation
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Quantitative Genetics. Continuous phenotypic variation within populations- not discrete characters Phenotypic variation due to both genetic and environmental.
Complex Traits Most neurobehavioral traits are complex Multifactorial
Quantitative Genetics
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
Association between genotype and phenotype
An quick overview of human genetic linkage analysis
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Linkage Disequilibrium Mapping of Complex Binary Diseases Two types of complex traits Quantitative traits–continuous variation Dichotomous traits–discontinuous.
An quick overview of human genetic linkage analysis Terry Speed Genetics & Bioinformatics, WEHI Statistics, UCB NWO/IOP Genomics Winterschool Mathematics.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lecture 22: Quantitative Traits II
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Why you should know about experimental crosses. To save you from embarrassment.
Types of genome maps Physical – based on bp Genetic/ linkage – based on recombination from Thomas Hunt Morgan's 1916 ''A Critique of the Theory of Evolution'',
Efficient calculation of empirical p- values for genome wide linkage through weighted mixtures Sarah E Medland, Eric J Schmitt, Bradley T Webb, Po-Hsiu.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Bayesian Variable Selection in Semiparametric Regression Modeling with Applications to Genetic Mappping Fei Zou Department of Biostatistics University.
Single Nucleotide Polymorphisms (SNPs
SNPs and complex traits: where is the hidden heritability?
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Identifying QTLs in experimental crosses
Complex Trait Genetics in Animal Models
Constrained Hidden Markov Models for Population-based Haplotyping
upstream vs. ORF binding and gene expression?
New Courses in the Fall Biodiversity -- Pennings
Statistical Methods for Quantitative Trait Loci (QTL) Mapping
Genome Wide Association Studies using SNP
Relationship between quantitative trait inheritance and
Gene mapping in mice Karl W Broman Department of Biostatistics
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Statistical issues in QTL mapping in mice
Power to detect QTL Association
Mapping Quantitative Trait Loci
Genome-wide Association Studies
Linking Genetic Variation to Important Phenotypes
Haplotype Reconstruction
Medical genomics BI420 Department of Biology, Boston College
Lecture 9: QTL Mapping II: Outbred Populations
Medical genomics BI420 Department of Biology, Boston College
Quantitative Trait Locus (QTL) Mapping
Presentation transcript:

Statistical Methods for Quantitative Trait Loci (QTL) Mapping II Lectures 5 – Oct 12, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022

Course Announcements HW #1 is out Project proposal Due next Wed 1 paragraph describing what you’d like to work on for the class project.

Any observable characteristic or trait Why are we so different? Any observable characteristic or trait Human genetic diversity Different “phenotype” Appearance Disease susceptibility Drug responses : Different “genotype” Individual-specific DNA 3 billion-long string TGATCGAAGCTAAATGCATCAGCTGATGATCCTAGC… TGATCGTAGCTAAATGCATCAGCTGATGATCGTAGC… ……ACTGTTAGGCTGAGCTAGCCCAAAATTTATAGCGTCGACTGCAGGGTCCACCAAAGCTCGACTGCAGTCGACGACCTAAAATTTAACCGACTACGAGATGGGCACGTCACTTTTACGCAGCTTGATGATGCTAGCTGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATTCACTTTTACGCAGCTTGATGACGACTACGAGATGGGCACGTTCACCATCTACTACTACTCATCTACTCATCAACCAAAAACACTACTCATCATCATCATCTACATCTATCATCATCACATCTACTGGGGGTGGGATAGATAGTGTGCTCGATCGATCGATCGTCAGCTGATCGACGGCAG…… TGATCGCAGCTAAATGCAGCAGCTGATGATCGTAGC…

Motivation Which sequence variation affects a trait? … cell cell Appearance, Personality, Disease susceptibility, Drug responses, … Which sequence variation affects a trait? Better understanding disease mechanisms Personalized medicine Sequence variations XX AG XXX GTC Different instruction Instruction ACTTCGGAACATATCAAATCCAACGC DNA – 3 billion long! … cell Obese? 15% Bold? 30% Diabetes? 6.2% Parkinson’s disease? 0.3% Heart disease? 20.1% Colon cancer? 6.5% : cell A different person A person

QTL mapping Data Phenotypes: yi = trait value for mouse i Genotypes: xik = 1/0 (i.e. AB/AA) of mouse i at marker k Genetic map: Locations of genetic markers Goals: Identify the genomic regions (QTLs) contributing to variation in the phenotype. Genotype data Phenotype data 1 2 3 4 5 … 3,000 3000 markers mouse individuals : 0101100100…011 1011110100…001 0010110000…010 : 0000010100…101 0010000000…100 1 :

Outline Statistical methods for mapping QTL QTL? What is QTL? Experimental animals Analysis of variance (marker regression) Interval mapping (EM) QTL? 1 2 3 4 5 … 3,000 mouse individuals : 1 :

Interval mapping [Lander and Botstein, 1989] Consider any one position in the genome as the location for a putative QTL. For a particular mouse, let z = 1/0 if (unobserved) genotype at QTL is AB/AA. Calculate P(z = 1 | marker data). Need only consider nearby genotyped markers. May allow for the presence of genotypic errors. Given genotype at the QTL, phenotype is distributed as N(µ+∆z, σ2). Given marker data, phenotype follows a mixture of normal distributions.

IM: the mixture model Nearest flanking markers M1/M2 99% AB 0 7 20 M1 QTL M2 65% AB 35% AA Let’s say that the mice with QTL genotype AA have average phenotype µA while the mice with QTL genotype AB have average phenotype µB. The QTL has effect ∆ = µB - µA. What are unknowns? µA and µB Genotype of QTL 35% AB 65% AA 99% AA

IM: estimation and LOD scores Use a version of the EM algorithm to obtain estimates of µA, µB, σ and expectation on z (an iterative algorithm). Calculate the LOD score Repeat for all other genomic positions (in practice, at 0.5 cM steps along genome).  

A simulated example Genetic markers LOD score curves

Interval mapping Advantages Disadvantages Make proper account of missing data Can allow for the presence of genotypic errors Pretty pictures High power in low-density scans Improved estimate of QTL location Disadvantages Greater computational effort (doing EM for each position) Requires specialized software More difficult to include covariates Only considers one QTL at a time

Statistical significance Large LOD score → evidence for QTL Question: How large is large? Answer 1: Consider distribution of LOD score if there were no QTL. Answer 2: Consider distribution of maximum LOD score. Null hypothesis – assuming that there are no QTLs segregating in the population. Null distribution of the LOD scores at a particular genomic position (solid curve) and of the maximum LOD score from a genome scan (dashed curve). Null distribution of the LOD scores at a particular genomic position (solid curve) Only ~3% of chance that the genomic position gets LOD score≥1.

LOD thresholds To account for the genome-wide search, compare the observed LOD scores to the null distribution of the maximum LOD score, genome-wide, that would be obtained if there were no QTL anywhere. LOD threshold = 95th percentile of the distribution of genome-wide max LOD, when there are no QTL anywhere. Methods for obtaining thresholds Analytical calculations (assuming dense map of markers) (Lander & Botstein, 1989) Computer simulations Permutation/ randomized test (Churchill & Doerge, 1994)

More on LOD thresholds Appropriate threshold depends on: Size of genome Number of typed markers Pattern of missing data Stringency of significance threshold Type of cross (e.g. F2 intercross vs backcross) Etc

An example Permutation distribution for a trait

Modeling multiple QTLs Trait variation that is not explained by a detected putative QTL. Advantages Reduce the residual variation and obtain greater power to detect additional QTLs. Identification of (epistatic) interactions between QTLs requires the joint modeling of multiple QTLs. Interactions between two loci The effect of QTL1 is the same, irrespective of the genotype of QTL 2, and vice versa The effect of QTL1 depends on the genotype of QTL 2, and vice versa

Multiple marker model Let y = phenotype, x = genotype data. Imagine a small number of QTL with genotypes x1,…,xp 2p or 3p distinct genotypes for backcross and intercross, respectively We assume that E(y|x) = µ(x1,…,xp), var(y|x) = σ2(x1,…,xp)

Multiple marker model Constant variance Assuming normality Additivity σ2(x1,…,xp) =σ2 Assuming normality y|x ~ N(µg, σ2) Additivity µ(x1,…,xp) = µ + ∑j ∆jxj Epistasis µ(x1,…,xp) = µ + ∑j ∆jxj + ∑j,k wj,kxjxk

Computational problem N backcross individuals, M markers in all with at most a handful expected to be near QTL xij = genotype (0/1) of mouse i at marker j yi = phenotype (trait value) of mouse i Assuming addivitity, yi = µ + ∑j ∆jxij + e which ∆j ≠ 0? Variable selection in linear regression models

Mapping QTL as model selection Select the class of models Additive models Additive with pairwise interactions Regression trees … x1 x2 xN w2 w1 wN Phenotype (y) y = w1 x1+…+wN xN+ε minimizew (w1x1 + … wNxN - y)2 ?

Linear Regression Search model space minimizew (w1x1 + … wNxN - y)2+model complexity Search model space Forward selection (FS) Backward deletion (BE) FS followed by BE … x1 x2 xN w1 w2 wN w2 w1 wN parameters Phenotype (y) Y = w1 x1+…+wN xN+ε 21

Lasso* (L1) Regression minimizew (w1x1 + … wNxN - y)2+  C |wi| L1 term minimizew (w1x1 + … wNxN - y)2+  C |wi| Induces sparsity in the solution w (many wi‘s set to zero) Provably selects “right” features when many features are irrelevant Convex optimization problem No combinatorial search Unique global optimum Efficient optimization … x1 x2 x1 x2 xN L2 L1 w1 w2 w2 w1 wN parameters Phenotype (y) * Tibshirani, 1996 22

Model selection Compare models Assess performance Likelihood function + model complexity (eg # QTLs) Cross validation test Sequential permutation tests Assess performance Maximize the number of QTL found Control the false positive rate

Outline Basic concepts Haplotype reconstruction Haplotype, haplotype frequency Recombination rate Linkage disequilibrium Haplotype reconstruction Parsimony-based approach EM-based approach

Review: genetic variation Single nucleotide polymorphism (SNP) Each variant is called an allele; each allele has a frequency Hardy Weinberg equilibrium (HWE) Relationship between allele and genotype frequencies How about the relationship between alleles of neighboring SNPs? We need to know about linkage (dis)equilibrium

Let’s consider the history of two neighboring alleles…

History of two neighboring alleles Alleles that exist today arose through ancient mutation events… Before mutation A After mutation A C Mutation

History of two neighboring alleles One allele arose first, and then the other… Before mutation A G C G After mutation A G C G C C Mutation Haplotype: combination of alleles present in a chromosome

Recombination can create more haplotypes G C C No recombination (or 2n recombination events) Recombination A G C C A C C G

A G C G C C A G C G C C A C Recombinant haplotype Without recombination A G C G C C With recombination A G C G C C A C Recombinant haplotype

Haplotype A combination of alleles present in a chromosome Each haplotype has a frequency, which is the proportion of chromosomes of that type in the population Consider N binary SNPs in a genomic region There are 2N possible haplotypes But in fact, far fewer are seen in human population

More on haplotype What determines haplotype frequencies? Recombination rate (r) between neighboring alleles Depends on the population r is different for different regions in genome Linkage disequilibrium (LD) Non-random association of alleles at two or more loci, not necessarily on the same chromosome. Why do we care about haplotypes or LD?

References Prof Goncalo Abecasis (Univ of Michigan)’s lecture note Broman, K.W., Review of statistical methods for QTL mapping in experimental crosses Doerge, R.W., et al. Statistical issues in the search for genes affecting quantitative traits in experimental populations. Stat. Sci.; 12:195-219, 1997. Lynch, M. and Walsh, B. Genetics and analysis of quantitative traits. Sinauer Associates, Sunderland, MA, pp. 431-89, 1998. Broman, K.W., Speed, T.P. A review of methods for identifying QTLs in experimental crosses, 1999.