Download presentation
Presentation is loading. Please wait.
1
Genome Wide Association Studies using SNP
2
The three GBS Analyses steps
1. Data processing A) Raw reads B) Read Cleaning and quality control 2. Read Mapping A) Aligning reads to reference genome B) Selecting best hit from multi-mapped reads C) BAM Conversation 3. Variant Discovery A) SNP/Indel Mining B) SNP/Indel filtering C) Practical applications i) Diversity analysis ii) GWAS
3
Experimental design Sample size: Aim for the largest sample size that your money will buy and you can phenotype. Rare alleles: If the trait you are interested in is associated with rare allele in the population you will need much larger numbers to identify effects in these SNP. Coverage: In GWAS hopefully there will always be a SNP that is in full (or almost) linkage disequilibrium with the causative gene.
4
Statistical Methods for GWAS
GWAS analysis exploit linkage disequilibrium (LD) LD is a population associations between markers and quantitative trait locus: QTL Associations arise because there are small segments on chromosomes in the current population which are descend from the same common ancestor. These segments from ancestor without recombination, will carry identical marker alleles or haplotypes. There are numbers of methodologies which exploit these associations.
5
Binary trait: example, disease state present absent
Odd ratio Pearson chi-square test Fisher exact test Correlation Trend test
6
Quantitative trait Univariate models: T-test Wilcox sum test ANOVA
Kruskal Wallis Test Multivariate models: Generalized linear model Mixed model
7
Testing marker on a trait
Null hypothesis: Marker has no effect on trait Alternative hypothesis: Marker does affect the trait Reject using F-statistics
8
Choice of significant level (alpha)
What value of alpha Bonferroni: The Bonferroni correction is an adjustment made to P values when several dependent or independent statistical tests are being performed simultaneously on a single data set. To perform a Bonferroni correction, divide the critical P value (α) by the number of comparisons being made Permutation testing: Set appropriate significance with multiple testing False discovery rate (FDR): The FDR is the expected proportion of detected QTL that are in fact false positives.
9
Population structure and confounding
Care needs to be taken to avoid sporous or inflated associations due to population structure. The main causes of confounding in GWAS are a) Population structure or existence of a major sub population in a population. b) Cryptic relatedness “i.e. existence of small groups (often pairs) of highly related individuals. c) Environmental differences between sub-populations or geographical locations. d) Differences in allele call rates between sub-populations.
10
Avoiding False Positive due to population structure
Any calculation unaccounted for due to population structure will results in false positive associations in GWAS. An alternative is to remove the effect of population structure using a model with population structure.
11
Controlling population structure
Genomic control, where chi-square test statistics are used. Structural association done by a Bayesian population model with individuals. Fitting population membership on covariates. Regression on a set of markers, where a set of fairly widely spaced markers covering the genome is used as covariates. Principal components, where number of e.g., 10 components are used as covariates. Mixed model method, where set of random effects is fitted for each individual with covariance based on estimated kinship matrix or known pedigree
12
TASSEL: General Linear Model (GLM)
TASSEL utilizes a fixed effects linear model to test for association between segregating sites and phenotypes. The analysis optionally accounts for population structure using covariates that indicate degree of membership in underlying populations. A main effects only model is automatically built using all variables in the input data. A separate model is built and solved for each trait and marker combination.
13
TASSEL: Mixed Linear Model (MLM)
This conducts association analysis via a mixed linear model (MLM). A mixed model is one which includes both fixed and random effects. Including random effects gives MLM the ability to incorporate information about relationships among individuals. When a genetic marker based kinship matrix (K) is used jointly with population structure (Q), the “Q+K” approach improves statistical power.
18
QQ plot explanation The QQ plot shows the expected distribution of association test statistics (X-axis) across the SNPs compared to the observed values (Y- axis). Any deviation from the X=Y line implies a consistent difference between cases and controls across the whole genome suggesting a bias (false positives association). A clean QQ plot should show a solid line matching X=Y until it sharply curves at the end (representing the small number of true associations among thousands of unassociated SNPs).
19
Manhattan plot
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.