Download presentation
Presentation is loading. Please wait.
Published bySamson Walters Modified over 6 years ago
1
LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis Zheng et al., Feb 2017 Features: (i) Calculate how inflated your GWAS results are due to confounding (ii) Check genetic correlation between your trait of interest and (lots of) other traits (iii) Calculate SNP heritability for your trait of interest The following slides are on LD hub, but also LD score regression Although just published a month ago, it is cited over 10 times – showing the potential of making things available as a preprint (e.g. in BioRxiv) Background reading: 1- LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Bulik-Sullivan et al, 2015, Nat Genet 2- Partitioning heritability by functional annotation using genome-wide association summary statistics. Finucane et al, Nat Genet Journal club: 03/03/17 Mesut Erzurumluoglu
2
Population Genetics 101 Linkage disequilibrium (LD)
Non-random association of alleles at two or more loci (if random alleles at two loci co- inherited 50% of the time) See journal club for ‘HAPRAP’ paper (Zheng et al, Feb 2017) – 21/09/16 For details, see my last journal club slides (dated: 21/09/16) Haploview software
3
GWAS summary-level resource in the post-GWAS era
11 years of genome-wide association studies: >2000 GWAS GWAS summary results are valuable resource for methods, e.g. LD score regression, two-sample Mendelian randomization, fine mapping, imputation Time consuming and challenging to collect and centralize data, harmonize information and setup an automated analysis pipeline
4
Introduction to study GWASs provide a powerful approach for identifying variants associated with complex human diseases/traits Lots of publicly available GWAS summary results (not individual-level data) Both polygenicity and confounding biases can cause an inflated distribution of the test statistics in GWAS Distinguishing inflation from a true polygenic signal from bias is important as there is strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWASs of large sample size LD Score regression quantifies the contribution of each by examining the relationship between the test statistics and their ‘LD score’. The LD Score regression intercept can be used to estimate a more accurate correction factor than genomic control (λ) LD Score regression can also be used to: Estimate the SNP heritability of complex traits/diseases Partition this value into functional categories (e.g. gene sets) Estimate the genetic (and phenotypic) correlation between different phenotypes LD hub (ldsc.broadinstitute.org) provides a user-friendly interface to make use of all the above in a reliable and efficient manner Polygenicity: many small genetic effects Confounding biases: e.g. cryptic relatedness and population stratification SNP heritability: total “narrow-sense” (additive) contribution of SNPs to a trait's heritability
5
Theory: LD score regression (LDSR)
LD score of a SNP= sum of r2 of all SNPs within 1cM (~1Mb) – currently computed using only HapMap 3 SNPs cM: distance between SNPs for which the expected probability of a crossover event happening in a single generation is 1% Basic idea: the more genetic variation a SNP tags, the higher the probability that it will tag a causal variant. In contrast, LD scores shouldn’t be correlated with population stratification
6
LD Score regression Estimate LD scores from a reference panel
The variants that tag more of the genome, will have higher LD scores. Beta/se=z score Estimate LD scores from a reference panel Regress chi-squared statistics on LD scores Image source:
7
LD Score Regression Univariate analysis Bivariate analysis
Top two are different trait e.g. bipolar disorder and schizophrenia test statistics X2 is Z score squared: Z1st x Z2nd, then regress on LD score Sample overlap did not bias the results – phenotypic correlation taking into account sample overlap LD Hub regresses test statistics from genome-wide SNPs against their LD scores. The slope is the SNP heritability of the trait. The intercept minus one from this regression is an estimator of the mean contribution of confounding to the inflation of the test statistics. Such an estimate is a more accurate measure of the test score inflation than genomic control. In a bivariate setting, LDSC regress the Z score from two traits against the LD scores. The slope is the genetic correlation between two traits and the intercept of bivariate regression protects such regression from sample overlapping of two GWASs
8
GREML v LD score regression
GREML: Genetic restricted maximum likelihood method Main paper: “Common SNPs explain a large proportion of the heritability for human height” Yang et al, Nat Genet Part of the well-known GCTA package Requires individual-level (genotype) data Largest meta-analyses are conducted via summary statistics Computationally expensive algorithm at large sample sizes – variance components method Run time depends on (i) sample size and (ii) no of traits analysed 45% of variance can be explained by considering all SNPs simultaneously; most of the heritability is not missing but has not previously been detected because the individual effects are too small to pass stringent significance tests.
9
LD Hub for LDSR LD score regression analysis Download/ask for data
LD score regression analysis pipeline QC other GWAS summary result QC your GWAS summary results LD score regression analysis Download/ask for data Harmonise data formats Traditional A few hours to months A few days to months A few days to months Results Repeat process for another trait comparison Reliable and replicable? Upload GWAS summary results Select options (i.e. click a few buttons) Results for up to 175 traits LD hub A few minutes A few minutes to a couple of hours Results: (i) Calculate lambda for LD score regression, (ii) Calculated SNP heritability, (ii) Calculated genetic correlation between up to 177 traits (as of Mar 2017) and your trait of interest
10
LD Hub overview LD Hub Database 219 publicly available GWAS traits
Test Center: On-the-fly LD score regression analysis pipeline Lookup Center: Existing LD score regression results lookup GWAShare Center: Summary data sharing & user contribution In the LD Hub, we have four major components: a database of 177 publicly available GWAS traits, a test center of automate LDSC analysis pipeline, a lookup center for existing LDSC results lookup and a GWAShare center for sharing summary data with the community
11
LD Hub database (v1.1) In LD Hub (v1.1), we selected high quality GWAS summary results of 177 unique traits, which included 23 diseases, 47 risk factors and 107 metabolites 219 traits now (1.3.1)
12
Why these datasets? Non-sex stratified White European ancestry
Traditional GWAS array n>450k SNPs Large sample sizes (n>5000) Mean chi-square of test statistics >1
13
LD Hub web interface ldsc.broadinstitute.org
14
GWAShare Center Data sharing “facilitator” rather than a repository
GWAShare center for GWAS summary resources sharing (important) LD hub also makes you aware of what’s publicly out there Data sharing “facilitator” rather than a repository LD Hub is continually updated and always requesting new datasets Users are encouraged to share the GWAS results with the community (see: Abraham M. 28/02/17. Don’t let useful data go to waste. Nature)
15
Lookup Center – SNP heritability
Comprehensive info on each study Info on studies: Download link, PMID, trait, Lambda, LD score regression intercept, ethnicities, sample size, n of SNPs H2: SNP heritability of trait Z_H2: SNP heritability Z score λ: Genomic inflation factor Intercept: LD score regression intercept
16
Comprehensive info for each trait comparison
Lookup Center – Genetic correlation Comprehensive info for each trait comparison Info on trait comparisons: Genetic correlation (rG), P value, SNP heritability, Phenotypic correlation (gcov_int) rg: Genetic correlation gcov_int: phenotypic correlation between two traits, which takes into account the influence of sample overlap between two GWA studies (e.g. if there is no sample overlap, the gcov_int will be near zero; if two traits are measured in the same samples, gcov_int will be equal to the phenotypic correlation between these two traits)
17
Test Center The test center allow users to upload their GWAS results, select GWAS of interests in LD Hub database, click go and collect the results, which greatly increase the speed of LD score regression analysis.
19
Mean X2 below 1.02: GWAS trait is not polygenic enough
20
Automated QC Once file is uploaded, the following QC steps are performed automatically by LD hub: Filtering SNPs: Keep MAF>5% Remove those absent in HapMap phase 3 and with a 1000 Genomes EUR MAF <5% Remove SNPs with effective sample size < sample size (90th percentile) x 0.67 Remove Indels and structural variants Remove if alleles do not match those in 1000 Genomes Remove SNPs in MHC region Remove SNPs with chi-squared >80 (i.e. large β)
21
Results – LD hub reliability
Differences: More SNP coverage also, where applicable LD hub results (blue) compared against previous results (Bulik-Sullivan et al, 2015, Nat Genet) – discrepancies due to new QC protocols and more recently published GWAS results
22
LD Hub application – atopic dermatitis
Twin studies suggest that eczema has a heritability of ~80% LD score regression calculates H2 to be 7.8% Narrow-sense heritability (i.e. SNP heritability) Heterogeneity in EAGLE consortium’s atopic dermatitis cases
23
LD Hub application – atopic dermatitis (continued)
Well-known association between asthma and eczema replicated Suggestion of correlation with other immune mediated diseases Follow up with larger studies
24
LD Hub application – Type 2 diabetes
Volcano plot shows nice clusters e.g. weight-related traits, lipid-related traits, branch-chained amino acids (e.g. Isoleucine, Valine)
25
Genetic correlation between metabolites and CHD
Volcano plot of genetic correlation between metabolites and CHD Interesting: 1- APOB not clustering with other HDL subtypes 2- All VLDL, IDL and LDL subclasses were positively correlated with CHD. 3- Most of the HDL subclass were negatively correlated with CHD except small HDL (S.HDL.TG) were positively correlated with CHD 4- Another potential finding is monounsaturated fatty acid (MUFA) is positively correlated with CHD 5- Some amino acids, e.g. Valine, are positively correlated with CHD
26
Integrative analytical strategy of LD Hub and MR-Base
Two-step strategy: Coronary Heart Disease and blood lipids Hypothesis Generation using LD Hub Hypothesis testing using MR-Base Compare to Observational results Trait 1 Trait 2 Method r(G) SE P value HDL CAD LDSC - rG -0.314 0.042 5.0 x 10-14 LDL 0.221 0.051 1.4 x 10-5 As a proof of principle, we test the genetic correlation of LDL/HDL and CHD using LD Hub. HDL show negative correlation while LDL show positive correlation to CHD. We further test the causality of these correlations. As expected, the correlation between HDL and CHD is not causal, and increase LDL level will casually increase the CHD risk. This reduces unnecessary multiple testing in MR Base Exposure Outcome Method Beta SE P value HDL CAD MR - Egger 0.056 0.087 0.52 LDL 0.443 0.061 1.12 x 10-10
27
Three way comparison: metabolites causally correlated with CHD
All three agree in direction Digging more for the causal correlation using MR-Base, interesting findings including 1. APOB and MUFA is positively and potentially causally correlated with CHD 2. HDLs were not correlated with CHD as expected, but triglycerides in small HDL is positively correlated with CHD
28
Discussion & Next steps
Growing the database Pre-existing / newly emerging GWASs phenotypes in UK Biobank: 150K, 500K Extending new methodology Bivariate stratified LD score regression Phenome-wide scan Beyond genetics Fine mapping Annotations and enrichments A Global GWAS summary results database GWAS data from Eastern Asian and African Stratify – rather than compare the whole genome
29
Limitations of LD score regression
Data collection, harmonisation and QC was a time-consuming task Solved with LD hub Substantial differences between reference panel and GWAS sample in terms of ancestry Inconsistency between LD patterns Hopefully will be solved by LD hub with more published GWASs in different ethnicities Not robust for all traits (e.g. if H2 is low) Small sample sizes are also a problem Use GCTA when n<3,000
30
Conclusions LD hub: Large GWAS summary statistics database with 200+ traits Fast bivariate LD score regression analysis: ~2 hours for all traits User friendly: click and collect ~340 million possible pair-wise correlations amongst multiple GWAS Standardized approached which improves robustness Can be used to generate hypotheses
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.