Association Modeling With iPlant
Goals of this Section Familiarize with the basic concepts of quantitative genetics: Traits, phenotypes, genotypes Understand the basics of trait mapping Understand the conceptual foundations of association studies Lear how to perform a genome wide association study in the iPlant Discovery Environment Obtain genotypes Run a Mixed Linear Model
Phenotype Observable (measurable) trait (character) of an organism Trait: eye color Phenotype: wild type (red), white eyed, orange eyed http://www.unc.edu/depts/our/hhmi/hhmi-ft_learning_modules/fruitflymodule/phenotypes.html
Qualitative Traits Campbell, 8e
Controlled by One Locus
Co-segregation in Pedigree Donahue, R. P., et al., Probable assignment of the Duffy blood group locus to chromosome 1 in man, Proceedings of the National Academy of Sciences 61, 949-955 (1968).
Quantitative Trait Carlos Harjes
Trait Varies on a Continuous Scale Frequency Trait Value
Quantitative Traits Probably caused by multiple loci Interaction effects Environment If the mean trait value for individuals with marker state MM is different from the mean trait value of individuals with marker state mm (i.e. the marker is associated with the phenotype), then the marker is linked to a quantitative trait locus.
Individuals Trait value Markers Marker #6 Mean Trait Value Present 110 ± 10 Absent 115 ± 13 Marker #3 Mean Trait Value Present 99 ± 5 Absent 118 ± 8
Quantitative Genetics Exploring the Genetic Architecture* Underlying Quantitative Traits *Genetic Architecture How many loci? Which location? How strong?
Tools for Statistical Genetics in the DE Purpose Genotype by Sequencing Workflow Automatic pipeline for extracting SNPs from GBS data (with genome from user or from iPlant database) UNEAK pipeline Automatic pipeline for extracting SNPs from GBS data without reference genomes MLM workflow Automatic workflow for fitting Mixed Linear Model GLM workflow Automatic workflow for fitting General Linear Model QTLC workflow Automatic workflow for composite interval mapping QTL simulation workflow Automatic workflow for simulating trait data with given linkage map PLINK PLINK implementation of various association models Zmapqtl Interval mapping and composite interval mapping with the options to perform a permutation test LRmapqtl Linear regression modeling SRmapqtl Stepwise regression modeling AntEpiSeeker Epistatic interaction modeling Random Jungle Random Forest implementation for GWAS FaST-LMM Factored Spectrally Transformed Linear Mixed Modeling Qxpak Versatile mixed modeling gluH2P Convert Hapmap format to Ped format LD Linkage Disequilibrium plot Structure Estimation of population structure PGDSpider Data conversion tool GLMstrucutre GLM with population structure as fixed effect
A Model for Quantitative Traits Phenotype Genotype Environment P = G + E + GG + GE P = G + e P=Phenotype G=Genotype E=Environment GG=Interaction between genotypes GE=Interaction between genotype and environment
A Statistical Model for QTLs P=G + e yij trait value in individual j with genotype i β0 population average of trait value β1 effect of marker i on trait value xi marker genotype i εij error term General Linear Model (in matrix notation): Y=Xb + e Note: If errors are not normally distributed, use generalized linear models
http://concord.org/publications/newsletter/2009-spring/genetics
Linkage Mapping (QTL Mapping) Designed population F2 Recombinant inbred (RIL) Double-Haploid (DH) Back-cross (B2)
Limitation of Linkage Mapping Needs large number of related individuals Resolution limited (interval contains 100s of genes) QTL position and effect are confounded
Association Mapping Use random collection of individuals from natural population Very dense marker map = very high resolution
Linkage & Recombination Recombination causes linkage decay Other factors affecting LD: Selection (artificial or natural) Drift Mutations Population structure Demography
Linkage Disequilibrium
Pitfalls: Population Structure Difference in allele frequencies between subpopulations Due to neutral or adaptive processes Can create spurious association
No association within groups
Similar effect due to presence of related individuals (esp. in plants) Can be accounted for using the data: Estimate number of subpopulations Assign individuals to subpopulation Estimate kinship
Accounting for Random Effects: Mixed Linear Models "Cost" associated with estimating a parameter We are not interested in the value of the parameter, only the variance Q-K method (structured association) y=Xβ+Sα+Qv+Zu+e Fixed effects: β Vector of fixed effects α Vector of SNPs effects v Vector of subpopulation effects Random effects: u Vector of kinship effects e Residuals Q Matrix of population association (STRUCTURE) X, S, Z Incidence Matrices
Traits MLM Markers Population Structure STRUCTURE Kinship TASSEL
Obtain Markers Genome Resequencing Workflow Genotyping By Sequencing
MLM Pipeline for GWAS Ed Buckler (Cornell University) TASSEL marker trait filter convert impute K GLM MLM Zhang et al. Nature Genetics. 2010; doi:10.1038/ng.546 http://www.maizegenetics.net/statistical-genetics http://www.maizegenetics.net/tassel/docs/Tassel_User_Guide_3.0.pdf
MLM Input Files Hapmap file Phenotype data Kinship matrix* traits strain Hapmap file Phenotype data Kinship matrix* Population structure* Population structure 3 populations sum to 1 strain * Kinship matrix & population structure data can be generated using TASSEL or with “MLM Workflow” App in DE
Origin Hapmap file: Phenotype data Kinship matrix Population structure Download (e.g. http://triticeaetoolbox.org/) Convert from PLINK (.map/.ped) using Tassel 3 Conversion Impute with NPUTE Transform to numerical format with NumericalTransform Phenotype data Kinship matrix Generate from hapmap marker data with Kinship Population structure Generate using ParallelStructure Convert to matrix with Structure2Tassel
MLM Output MLM1.txt MLM2.txt MLM3.txt See TASSEL manual for details: Marker “df” degrees of freedom “F” F distribution for test of marker “p” p-value “errordf” df used for denominator of F-test etc. MLM2.txt Estimated effect for each allele for each marker MLM3.txt The compression results shows the likelihood, genetic variance, and error variance for each compression level tested during the optimization process. See TASSEL manual for details: http://www.maizegenetics.net/tassel/docs/Tassel_User_Guide_3.0.pdf
THANKS!