Download presentation
Presentation is loading. Please wait.
1
POLYMORPHISMS & ASSOCIATION TESTS
Arián Avalos Mayo-Illinois Computational Genomics Workshop June 22, 2017
2
OUTLINE Molecular Markers Genome Wide Association Studies (GWAS)
Beyond Single Locus Functional Effects
3
Individuals I2 and I5 have a variation (T → A). This position is both.
MOLECULAR MARKERS What is a SNP and a SNV? Single Nucleotide Polymorphysm Single Nucleotide Variant I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT Individuals I2 and I5 have a variation (T → A). This position is both.
4
MOLECULAR MARKERS A SNV is any change (e.g. a somatic mutation, even an artifact). A SNP has defining criteria Polymorphic SNV, have “Major” and “minor” alleles Sometimes defined by frequency level (e.g. minimum allele frequency of 5%) Segregating sites, germplasm polymorphism in a population For reference, the 1000 Genomes project identified ~41 Million SNPs across ~1000 Individuals.
5
MOLECULAR MARKERS Both types of variants are relevant depending of the field Population geneticists conducting association test will focus on SNPs Cancer geneticists will instead be interested in SNVs The terminology is further complicated in non-human biology (e.g. polyploidy, horizontal gene transfer, etc.)
6
GWAS: Resources Zhang X. et al. (2012). PLoS Comput Biol 8(12): e Bush W. S. & Moore J. H. (2012) PLoS Comput Biol 8(12): e
7
GWAS Example: Approach: Genetic Linkage Analysis
Cystic Fibrosis and the CFTR gene mutations Approach: Genetic Linkage Analysis Genotype family members (some individuals carrying the disease) Find a marker that correlates with the disease Disease gene lies close to this marker
8
GWAS Concept: When a marker is correlated with a trait, it is likely to be genetically linked to the locus in a family analysis
9
GWAS
10
GWAS Limitations of Genetic Linkage Analysis
Requires data from entire families, preferably large ones, where the trait is segregating Linkage analysis less successful with common diseases, e.g., heart disease or cancers. Requires single, large effect loci
11
GWAS Hypothesize that common diseases are influenced by common genetic variation in the population Implications: Any individual variation (SNP) will have relatively small correlation with the disease The combinatorial effect of many alleles is what influence the disease phenotype This argues for population- rather than family-based studies.
12
GWAS Bush W. S. & Moore J. H. (2012) PLoS Comput Biol 8(12): e
13
GWAS: Genotyping Microarray – can assay 0.5 – 1.0 Million or more SNPs
Whole-genome sequencing (WGS) – assays (near) complete SNP profile In non-human genetics, reduced-representation methods provide a middle-ground.
14
GWAS: Phenotyping Case / Control – qualitative, usually binary measure (e.g. disease vs. no disease) Quantitative – continuous measure usually complex phenotypes (e.g. blood pressure, LDL levels) Possible to look at more than one phenotype?
15
GWAS Case / Control Disease?
I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -
16
GWAS Before analysis and interpretations a few considerations:
Correlation is not causation
17
GWAS Before analysis and interpretations a few considerations:
Correlation is not causation Linkage disequilibrium (see later) Population structure (see later) Phenotyping
18
GWAS Further consider that even if the analysis is successful, findings can be hard to interpret Example: SNP correlates well with heart disease Biochemical link? Behavioral link (you particularly like bacon…)?
19
GWAS: Statistics Case vs. Control A T
3 1 Control 9 Case vs. Control I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I2: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I8: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I9: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I10: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT + I11: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I12: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I13: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT - I14: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +
20
GWAS: Statistics 3 All 14 Case 4 A p-value < 0.05 Case vs. Control
The Fisher’s Exact Test 3 All 14 Case 4 A A T Case 3 1 Control 9 p-value < 0.05
21
GWAS: Statistics An favored alternative to the Fisher’s Exact Test is the Chi-Squared Test. We conduct this test on EACH SNP separately, and get a corresponding p-value. The smallest p-values point to the SNPs most associated with the disease.
22
GWAS: Statistics Either Fisher’s or the Chi-Squared Test are considered an allelic association test, i.e. we test if A instead of T at the polymorphic site correlates with the disease. In a genotypic association test each position is a combination of two alleles, e.g. AA, TT, AT We therefore correlate genotype with phenotype of the individual
23
GWAS: Statistics There are various options for a Case vs. Control genotypic association test Example: Dominant Model AA or AT TT Case ? Control
24
GWAS: Statistics There are various options for a Case vs. Control genotypic association test Example: Dominant Model Recessive Model AA AT or TT Case ? Control
25
GWAS: Statistics There are various options for a Case vs. Control genotypic association test Example: Dominant Model Recessive Model 2x3 Table AA AT TT Case O11 O12 O13 Control O21 O22 O23 Χ 2 = 𝑖 𝑗 𝑂 𝑖𝑗 − 𝐸 𝑖𝑗 𝐸 𝑖𝑗 Chi-Squared Test
26
GWAS: Statistics Quantitative Phenotypes 𝑌=𝑎+𝑏𝑋
If no association, 𝑏 ≈0 The more 𝑏 Δ 0, the stronger the association This is called linear regression
27
GWAS: Statistics Quantitative Phenotypes
Another statistical test commonly used on GWAS matrices is Analysis of Variance (ANOVA) Statistical models for GWAS can get quite involved (can give references on request)
28
GWAS: Statistics Lambert et al., 2013: Nature Genetics 45, 1452
29
GWAS: Statistics Multiple Hypothesis Correction
What does p-value = 0.01 mean? It means that the observed Genotype x Phenotype correlation has only 1% probability of happening just by chance. What if we repeat the test for 1 Million SNPs? Of those tests, 1% (10,000 SNPs) will show this level of correlation, just by chance (and by definition)
31
GWAS: Statistics Multiple Hypothesis Correction Bonferroni
Multiply the p-value by the number of tests So if the original SNP had p-value 𝑝, the new p-value is defined as 𝑝 ′ =𝑝 ×𝑁 With 𝑁= , a p-value of 10 −9 is downgraded to: 𝑝 ′ = 10 −9 × = 10 −3 This is quite good!
32
GWAS: Statistics Multiple Hypothesis Correction False Discovery Rate
Given a threshold 𝛼 (e.g. 0.05) Sort all p-values (𝑁 of them) in ascending order (i.e. 𝑝 1 ≤ 𝑝 2 ≤ … ≤ 𝑝 𝑛 ) Count for each group of 𝑁: Require 𝑝 ′ <𝛼 This ensures the expected proportion of false positives in the reported association is <𝛼 𝑝 𝑖 ′ = 𝑝 𝑖 × 𝑁 𝑖
33
BEYOND SINGLE LOCUS So far we have tested each SNP separately, however recall our hypothesis that common diseases are influenced by common variants Maybe considering two SNPs together will identify a stronger correlation with phenotype Main problem: Number of pairs ~ 𝑁 2
34
BEYOND SINGLE LOCUS Further consider, in genotyping we may be using a Microarray (e.g. 0.5 – 1 Million SNPs) But there are many more sites in the human genome where variation may exist, will we then miss any causal variant outside the panel of ~1 Million? Not necessarily
35
BEYOND SINGLE LOCUS Two sites close to each other may vary in a highly correlated manner, this is Linkage Disequilibrium (LD) In this situation, lack of recombination events have made the inheritance of those two sites dependent If two such sites have high LD, then one site can serve as proxy for the other
36
BEYOND SINGLE LOCUS So if sites X & Y have high LD, and X is in the Microarray, then knowing the allelic form of X informs the allelic form at Y In this way a reduced panel can represent a larger number (all?) of the common SNPs
37
BEYOND SINGLE LOCUS Impact of LD
38
BEYOND SINGLE LOCUS A problem is that if X correlates with a disease, the causal variant may be either X or Y
39
BEYOND SINGLE LOCUS Impact of Population Structure.
40
BEYOND SINGLE LOCUS In many cases, able to find SNPs that have significant association with disease. GWAS Catalog Yet, final predictive power (ability to predict disease from genotype) is limited for complex diseases. Finding the Missing Heritability of Complex Diseases
41
BEYOND SINGLE LOCUS Increasingly, whole-exome and even whole-genome sequencing used for variant detection Taking on the non-coding variants. Use functional genomics data as template Network-based analysis rather than single-site or site- pairs analysis Complement GWAS with family-based studies
42
FUNCTIONAL EFFECTS How do we predict how a variant is likely to be affecting protein function?
43
FUNCTIONAL EFFECTS Case:
I found a SNP inside the coding sequence. Knowing how to translate the gene sequence to a protein sequence, I discovered that this is a non-synonymous change, i.e., the encoded amino acid changes. This is an nsSNP. Will that impact the protein’s function? (And I don’t quite know how the protein functions in the first place ...)
44
FUNCTIONAL EFFECTS Two popular approaches: PolyPhen 2.0 SIFT
Adzhubei, I. A. et al. (2010). Nat Methods 7(4): SIFT Kumar P. et al., (2009). Nat Protoc 4(7):
45
FUNCTIONAL EFFECTS PolyPhen 2.0
46
FUNCTIONAL EFFECTS The PolyPhen 2.0 pipeline uses existing data sets for training and later evaluation of target data. Specifically the HumDiv data base which is A compilation of all the damaging mutations with known effects of molecular function A collection of non-damaging differences between human proteins and those of closely related mammalian homologs
47
FUNCTIONAL EFFECTS PolyPhen 2.0 features:
48
FUNCTIONAL EFFECTS A look at the Multiple Sequence Alignment (MSA) part of the PolyPhen 2.0 pipeline:
49
FUNCTIONAL EFFECTS Of interest is the Position Specific Independent Count (PSIC) Score. This score reflects the amino acid’s frequency at the specific position in the sequence given an MSA
50
FUNCTIONAL EFFECTS Example:
51
FUNCTIONAL EFFECTS To derive the PSIC score we first calculate the frequency of each amino acid: p 𝑎, 𝑖 = 𝑛 𝑎, 𝑖 𝑒𝑓𝑓 𝑏 𝑛 𝑏, 𝑖 𝑒𝑓𝑓
52
FUNCTIONAL EFFECTS p 𝑎, 𝑖 = 𝑛 𝑎, 𝑖 𝑒𝑓𝑓 𝑏 𝑛 𝑏, 𝑖 𝑒𝑓𝑓 The idea:
𝑛 𝑎, 𝑖 𝑒𝑓𝑓 is not the raw count of amino acid “𝑎” at position 𝑖 but rather it is adjusted for the many closely related sequences in the MSA The PSIC score of a SNP 𝑎 →𝑏 at position 𝑖 is given by: p 𝑎, 𝑖 = 𝑛 𝑎, 𝑖 𝑒𝑓𝑓 𝑏 𝑛 𝑏, 𝑖 𝑒𝑓𝑓 PSIC 𝑎→𝑏, 𝑖 ∝ln 𝑝 𝑏, 𝑖 𝑝 𝑎, 𝑖
53
FUNCTIONAL EFFECTS Ultimately your derived score can be compared with the existing scores from HumDiv
54
FUNCTIONAL EFFECTS Classification Naive Bayes method
A type of classifier. Other classification algorithms include “Support Vector Machine”, “Decision Tree”, “Neural Net”, “Random Forest” etc. Sometimes called “Machine Learning” What is a classification algorithm? What is a Naive Bayes method/classifier?
55
FUNCTIONAL EFFECTS 𝑥 11 , 𝑥 12 , …, 𝑥 1𝑛 + 𝑥 21 , 𝑥 22 , …, 𝑥 2𝑛 + …
𝑥 11 , 𝑥 12 , …, 𝑥 1𝑛 + 𝑥 21 , 𝑥 22 , …, 𝑥 2𝑛 + … 𝑥 𝑖+1, 1 , 𝑥 𝑖+1, 2 , …, 𝑥 𝑖+1𝑛 - 𝑥 𝑖+2, 1 , 𝑥 𝑖+2, 2 , …, 𝑥 𝑖+2𝑛 - Positive examples Negative examples Training Data MODEL “Supervised Learning”
56
FUNCTIONAL EFFECTS Data Vector 𝑥 1 , 𝑥 2 , …, 𝑥 𝑛 MODEL Yes or No
57
FUNCTIONAL EFFECTS Naïve Bayes Classifier Bayesian Inference: + or −
Training Data Bayesian Inference: Expresses how a subjective assessment of likelihood should rationally change to account for evidence + or −
58
FUNCTIONAL EFFECTS In statistics, frequentists and Bayesians often disagree. A frequentist is a person whose long-run ambition is to be wrong 5% of the time. A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.
59
Or…
60
FUNCTIONAL EFFECTS Evaluating a classifier: Cross-validation
PREDICT AND EVALUATE ON THESE TRAIN ON THESE FOLD 1
61
FUNCTIONAL EFFECTS Evaluating a classifier: Cross-validation
PREDICT AND EVALUATE ON THESE FOLD 2
62
FUNCTIONAL EFFECTS Evaluating a classifier: Cross-validation
PREDICT AND EVALUATE ON THESE FOLD k
63
Collect all evaluation results (from k “FOLD”s)
FUNCTIONAL EFFECTS Evaluating a classifier: Cross-validation Collect all evaluation results (from k “FOLD”s)
64
FUNCTIONAL EFFECTS Evaluating Classification Performance Wikipedia
65
FUNCTIONAL EFFECTS The Receiver Operating Characteristic (ROC) curve
True +ve vs False +ve
66
FUNCTIONAL EFFECTS What about those SNPs outside the coding regions?
Generally hard enough to predict within coding regions – regulatory sequences notoriously hard to pin down (see ENCODE controversy) One interesting new approach uses Support Vector Machine (SVM) classifiers to describe damage to cell-specific regulatory motif vocabularies.
67
FUNCTIONAL EFFECTS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.