POLYMORPHISMS & ASSOCIATION TESTS

Slides:



Advertisements
Similar presentations
Linkage and Genetic Mapping
Advertisements

What is an association study? Define linkage disequilibrium
Association Tests for Rare Variants Using Sequence Data
Basics of Linkage Analysis
Patterns of inheritance
MALD Mapping by Admixture Linkage Disequilibrium.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Unit 4 Vocabulary Review. Nucleic Acids Organic molecules that serve as the blueprint for proteins and, through the action of proteins, for all cellular.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Non-Mendelian Genetics
Rules of Inheritance The rules of inheritance were unknown when Darwin developed the theory of natural selection The ‘hip’ idea at the time was the ‘blending.
Regulation of gene expression in the mammalian eye and its relevance to eye disease Todd Scheetz et al. Presented by John MC Ma.
Copyright © 2013 Pearson Education, Inc. All rights reserved. Chapter 4 Genetics: From Genotype to Phenotype.
Genome-Wide Association Study (GWAS)
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
POLYMORPHISM AND VARIANT ANALYSIS Saurabh Sinha, University of Illinois.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Population Dynamics Humans, Sickle-cell Disease, and Malaria How does a population of humans become resistant to malaria?
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Genetics of Gene Expression BIOS Statistics for Systems Biology Spring 2008.
General Genetics Chapter 14 Mendel and the Gene Idea.
 Structural genes: genes that contain the information to make a protein.  Regulatory genes: guide the expression of structural genes, without coding.
Genome-Wides Association Studies (GWAS) Veryan Codd.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
HS-LS-3 Apply concepts of statistics and probability to support explanations that organisms with an advantageous heritable trait tend to increase in proportion.
Power Calculations for GWAS
SNPs and complex traits: where is the hidden heritability?
A Multi-stage Approach to Detect Gene-gene Interactions Associated with Multiple Correlated Phenotypes Zhou Xiangdong,Keith Chan, Danhong Zhu Department.
Genomic Analysis: GWAS
Common variation, GWAS & PLINK
Difference between a monohybrid cross and a dihybrid cross
MULTIPLE GENES AND QUANTITATIVE TRAITS
Part 2: Genetics, monohybrid vs. Dihybrid crosses, Chi Square
upstream vs. ORF binding and gene expression?
POLYMORPHISM AND VARIANT ANALYSIS
Linkage and Linkage Disequilibrium
Genome Wide Association Studies using SNP
Gene-set analysis Danielle Posthuma & Christiaan de Leeuw
Introduction to bioinformatics lecture 11 SNP by Ms.Shumaila Azam
Gene Hunting: Design and statistics
BMI/CS 776 Spring 2018 Anthony Gitter
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
High level GWAS analysis
Epidemiology 101 Epidemiology is the study of the distribution and determinants of health-related states in populations Study design is a key component.
Power to detect QTL Association
MULTIPLE GENES AND QUANTITATIVE TRAITS
Linkage, Recombination, and Eukaryotic Gene Mapping
Genome-wide Associations
Beyond GWAS Erik Fransen.
Linking Genetic Variation to Important Phenotypes
Correlation for a pair of relatives
Genetics: Mendel & The Gene Idea.
Mutations Learning Goal: To learn about what the causes, types and effects of mutations. Success Criteria: I know I am succeeding when I can… explain that.
Introduction to Mendelian Genetics
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Chapter 7 Multifactorial Traits
Exercise: Effect of the IL6R gene on IL-6R concentration
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Lecture # 6 Date _________
Medical genomics BI420 Department of Biology, Boston College
Genetics of Human Cardiovascular Disease
What has happened? Substitution mutation
Medical genomics BI420 Department of Biology, Boston College
Evan G. Williams, Johan Auwerx  Cell 
GWAS-eQTL signal colocalisation methods
Evaluating Classifiers for Disease Gene Discovery
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

POLYMORPHISMS & ASSOCIATION TESTS Arián Avalos Mayo-Illinois Computational Genomics Workshop June 22, 2017

OUTLINE Molecular Markers Genome Wide Association Studies (GWAS) Beyond Single Locus Functional Effects

Individuals I2 and I5 have a variation (T → A). This position is both. MOLECULAR MARKERS What is a SNP and a SNV? Single Nucleotide Polymorphysm Single Nucleotide Variant I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT Individuals I2 and I5 have a variation (T → A). This position is both.

MOLECULAR MARKERS A SNV is any change (e.g. a somatic mutation, even an artifact). A SNP has defining criteria Polymorphic SNV, have “Major” and “minor” alleles Sometimes defined by frequency level (e.g. minimum allele frequency of 5%) Segregating sites, germplasm polymorphism in a population For reference, the 1000 Genomes project identified ~41 Million SNPs across ~1000 Individuals.

MOLECULAR MARKERS Both types of variants are relevant depending of the field Population geneticists conducting association test will focus on SNPs Cancer geneticists will instead be interested in SNVs The terminology is further complicated in non-human biology (e.g. polyploidy, horizontal gene transfer, etc.)

GWAS: Resources Zhang X. et al. (2012). PLoS Comput Biol 8(12): e1002828.  Bush W. S. & Moore J. H. (2012) PLoS Comput Biol 8(12): e1002822.

GWAS Example: Approach: Genetic Linkage Analysis Cystic Fibrosis and the CFTR gene mutations Approach: Genetic Linkage Analysis Genotype family members (some individuals carrying the disease) Find a marker that correlates with the disease Disease gene lies close to this marker

GWAS Concept: When a marker is correlated with a trait, it is likely to be genetically linked to the locus in a family analysis

GWAS

GWAS Limitations of Genetic Linkage Analysis Requires data from entire families, preferably large ones, where the trait is segregating Linkage analysis less successful with common diseases, e.g., heart disease or cancers. Requires single, large effect loci

GWAS Hypothesize that common diseases are influenced by common genetic variation in the population Implications: Any individual variation (SNP) will have relatively small correlation with the disease The combinatorial effect of many alleles is what influence the disease phenotype This argues for population- rather than family-based studies.

GWAS Bush W. S. & Moore J. H. (2012) PLoS Comput Biol 8(12): e1002822

GWAS: Genotyping Microarray – can assay 0.5 – 1.0 Million or more SNPs Whole-genome sequencing (WGS) – assays (near) complete SNP profile In non-human genetics, reduced-representation methods provide a middle-ground.

GWAS: Phenotyping Case / Control – qualitative, usually binary measure (e.g. disease vs. no disease) Quantitative – continuous measure usually complex phenotypes (e.g. blood pressure, LDL levels) Possible to look at more than one phenotype?

GWAS Case / Control Disease? I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -

GWAS Before analysis and interpretations a few considerations: Correlation is not causation

GWAS Before analysis and interpretations a few considerations: Correlation is not causation Linkage disequilibrium (see later) Population structure (see later) Phenotyping

GWAS Further consider that even if the analysis is successful, findings can be hard to interpret Example: SNP correlates well with heart disease Biochemical link? Behavioral link (you particularly like bacon…)?

GWAS: Statistics Case vs. Control A T 3 1 Control 9 Case vs. Control I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I2: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I8: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I9: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I10: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT + I11: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I12: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I13: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT - I14: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +

GWAS: Statistics 3 All 14 Case 4 A p-value < 0.05 Case vs. Control The Fisher’s Exact Test 3 All 14 Case 4 A A T Case 3 1 Control 9 p-value < 0.05

GWAS: Statistics An favored alternative to the Fisher’s Exact Test is the Chi-Squared Test. We conduct this test on EACH SNP separately, and get a corresponding p-value. The smallest p-values point to the SNPs most associated with the disease.

GWAS: Statistics Either Fisher’s or the Chi-Squared Test are considered an allelic association test, i.e. we test if A instead of T at the polymorphic site correlates with the disease. In a genotypic association test each position is a combination of two alleles, e.g. AA, TT, AT We therefore correlate genotype with phenotype of the individual

GWAS: Statistics There are various options for a Case vs. Control genotypic association test Example: Dominant Model AA or AT TT Case ? Control

GWAS: Statistics There are various options for a Case vs. Control genotypic association test Example: Dominant Model Recessive Model AA AT or TT Case ? Control

GWAS: Statistics There are various options for a Case vs. Control genotypic association test Example: Dominant Model Recessive Model 2x3 Table AA AT TT Case O11 O12 O13 Control O21 O22 O23 Χ 2 = 𝑖 𝑗 𝑂 𝑖𝑗 − 𝐸 𝑖𝑗 2 𝐸 𝑖𝑗 Chi-Squared Test

GWAS: Statistics Quantitative Phenotypes 𝑌=𝑎+𝑏𝑋 If no association, 𝑏 ≈0 The more 𝑏 Δ 0, the stronger the association This is called linear regression

GWAS: Statistics Quantitative Phenotypes Another statistical test commonly used on GWAS matrices is Analysis of Variance (ANOVA) Statistical models for GWAS can get quite involved (can give references on request)

GWAS: Statistics Lambert et al., 2013: Nature Genetics 45, 1452

GWAS: Statistics Multiple Hypothesis Correction What does p-value = 0.01 mean? It means that the observed Genotype x Phenotype correlation has only 1% probability of happening just by chance. What if we repeat the test for 1 Million SNPs? Of those tests, 1% (10,000 SNPs) will show this level of correlation, just by chance (and by definition)

GWAS: Statistics Multiple Hypothesis Correction Bonferroni Multiply the p-value by the number of tests So if the original SNP had p-value 𝑝, the new p-value is defined as 𝑝 ′ =𝑝 ×𝑁 With 𝑁= 10 6 , a p-value of 10 −9 is downgraded to: 𝑝 ′ = 10 −9 × 10 6 = 10 −3 This is quite good!

GWAS: Statistics Multiple Hypothesis Correction False Discovery Rate Given a threshold 𝛼 (e.g. 0.05) Sort all p-values (𝑁 of them) in ascending order (i.e. 𝑝 1 ≤ 𝑝 2 ≤ … ≤ 𝑝 𝑛 ) Count for each group of 𝑁: Require 𝑝 ′ <𝛼 This ensures the expected proportion of false positives in the reported association is <𝛼 𝑝 𝑖 ′ = 𝑝 𝑖 × 𝑁 𝑖

BEYOND SINGLE LOCUS So far we have tested each SNP separately, however recall our hypothesis that common diseases are influenced by common variants Maybe considering two SNPs together will identify a stronger correlation with phenotype Main problem: Number of pairs ~ 𝑁 2

BEYOND SINGLE LOCUS Further consider, in genotyping we may be using a Microarray (e.g. 0.5 – 1 Million SNPs) But there are many more sites in the human genome where variation may exist, will we then miss any causal variant outside the panel of ~1 Million? Not necessarily

BEYOND SINGLE LOCUS Two sites close to each other may vary in a highly correlated manner, this is Linkage Disequilibrium (LD) In this situation, lack of recombination events have made the inheritance of those two sites dependent If two such sites have high LD, then one site can serve as proxy for the other

BEYOND SINGLE LOCUS So if sites X & Y have high LD, and X is in the Microarray, then knowing the allelic form of X informs the allelic form at Y In this way a reduced panel can represent a larger number (all?) of the common SNPs

BEYOND SINGLE LOCUS Impact of LD

BEYOND SINGLE LOCUS A problem is that if X correlates with a disease, the causal variant may be either X or Y

BEYOND SINGLE LOCUS Impact of Population Structure.

BEYOND SINGLE LOCUS In many cases, able to find SNPs that have significant association with disease. GWAS Catalog Yet, final predictive power (ability to predict disease from genotype) is limited for complex diseases. Finding the Missing Heritability of Complex Diseases

BEYOND SINGLE LOCUS Increasingly, whole-exome and even whole-genome sequencing used for variant detection Taking on the non-coding variants. Use functional genomics data as template Network-based analysis rather than single-site or site- pairs analysis Complement GWAS with family-based studies

FUNCTIONAL EFFECTS How do we predict how a variant is likely to be affecting protein function?

FUNCTIONAL EFFECTS Case: I found a SNP inside the coding sequence. Knowing how to translate the gene sequence to a protein sequence, I discovered that this is a non-synonymous change, i.e., the encoded amino acid changes. This is an nsSNP. Will that impact the protein’s function? (And I don’t quite know how the protein functions in the first place ...)

FUNCTIONAL EFFECTS Two popular approaches: PolyPhen 2.0 SIFT Adzhubei, I. A. et al. (2010). Nat Methods 7(4):248-249 SIFT Kumar P. et al., (2009). Nat Protoc 4(7):1073-1081

FUNCTIONAL EFFECTS PolyPhen 2.0

FUNCTIONAL EFFECTS The PolyPhen 2.0 pipeline uses existing data sets for training and later evaluation of target data. Specifically the HumDiv data base which is A compilation of all the damaging mutations with known effects of molecular function A collection of non-damaging differences between human proteins and those of closely related mammalian homologs

FUNCTIONAL EFFECTS PolyPhen 2.0 features:

FUNCTIONAL EFFECTS A look at the Multiple Sequence Alignment (MSA) part of the PolyPhen 2.0 pipeline:

FUNCTIONAL EFFECTS Of interest is the Position Specific Independent Count (PSIC) Score. This score reflects the amino acid’s frequency at the specific position in the sequence given an MSA

FUNCTIONAL EFFECTS Example:

FUNCTIONAL EFFECTS To derive the PSIC score we first calculate the frequency of each amino acid: p 𝑎, 𝑖 = 𝑛 𝑎, 𝑖 𝑒𝑓𝑓 𝑏 𝑛 𝑏, 𝑖 𝑒𝑓𝑓

FUNCTIONAL EFFECTS p 𝑎, 𝑖 = 𝑛 𝑎, 𝑖 𝑒𝑓𝑓 𝑏 𝑛 𝑏, 𝑖 𝑒𝑓𝑓 The idea: 𝑛 𝑎, 𝑖 𝑒𝑓𝑓 is not the raw count of amino acid “𝑎” at position 𝑖 but rather it is adjusted for the many closely related sequences in the MSA The PSIC score of a SNP 𝑎 →𝑏 at position 𝑖 is given by: p 𝑎, 𝑖 = 𝑛 𝑎, 𝑖 𝑒𝑓𝑓 𝑏 𝑛 𝑏, 𝑖 𝑒𝑓𝑓 PSIC 𝑎→𝑏, 𝑖 ∝ln 𝑝 𝑏, 𝑖 𝑝 𝑎, 𝑖

FUNCTIONAL EFFECTS Ultimately your derived score can be compared with the existing scores from HumDiv

FUNCTIONAL EFFECTS Classification Naive Bayes method A type of classifier. Other classification algorithms include “Support Vector Machine”, “Decision Tree”, “Neural Net”, “Random Forest” etc. Sometimes called “Machine Learning” What is a classification algorithm? What is a Naive Bayes method/classifier?

FUNCTIONAL EFFECTS 𝑥 11 , 𝑥 12 , …, 𝑥 1𝑛 + 𝑥 21 , 𝑥 22 , …, 𝑥 2𝑛 + … 𝑥 11 , 𝑥 12 , …, 𝑥 1𝑛 + 𝑥 21 , 𝑥 22 , …, 𝑥 2𝑛 + … 𝑥 𝑖+1, 1 , 𝑥 𝑖+1, 2 , …, 𝑥 𝑖+1𝑛 - 𝑥 𝑖+2, 1 , 𝑥 𝑖+2, 2 , …, 𝑥 𝑖+2𝑛 - Positive examples Negative examples Training Data MODEL “Supervised Learning”

FUNCTIONAL EFFECTS Data Vector 𝑥 1 , 𝑥 2 , …, 𝑥 𝑛 MODEL Yes or No

FUNCTIONAL EFFECTS Naïve Bayes Classifier Bayesian Inference: + or − Training Data Bayesian Inference: Expresses how a subjective assessment of likelihood should rationally change to account for evidence + or −

FUNCTIONAL EFFECTS In statistics, frequentists and Bayesians often disagree. A frequentist is a person whose long-run ambition is to be wrong 5% of the time. A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.

Or…

FUNCTIONAL EFFECTS Evaluating a classifier: Cross-validation PREDICT AND EVALUATE ON THESE TRAIN ON THESE FOLD 1

FUNCTIONAL EFFECTS Evaluating a classifier: Cross-validation PREDICT AND EVALUATE ON THESE FOLD 2

FUNCTIONAL EFFECTS Evaluating a classifier: Cross-validation PREDICT AND EVALUATE ON THESE FOLD k

Collect all evaluation results (from k “FOLD”s) FUNCTIONAL EFFECTS Evaluating a classifier: Cross-validation Collect all evaluation results (from k “FOLD”s)

FUNCTIONAL EFFECTS Evaluating Classification Performance Wikipedia

FUNCTIONAL EFFECTS The Receiver Operating Characteristic (ROC) curve True +ve vs False +ve

FUNCTIONAL EFFECTS What about those SNPs outside the coding regions? Generally hard enough to predict within coding regions – regulatory sequences notoriously hard to pin down (see ENCODE controversy) One interesting new approach uses Support Vector Machine (SVM) classifiers to describe damage to cell-specific regulatory motif vocabularies.

FUNCTIONAL EFFECTS