Download presentation
Presentation is loading. Please wait.
Published byPhilippa Heath Modified over 8 years ago
1
Understanding GWAS SNPs Xiaole Shirley Liu Stat 115/215
2
Pace of GWAS Studies 2
3
GWAS SNPs Association <> Causal What’s the most likely causal SNP / Gene in LD with the genotyped SNP? Use functional genomics to identify the disease tissue of origin What’s the SNP doing in non-coding regions? RSNPs 3
4
Use Literature & Pathway Information to Identify Putative Causal SNPs / Genes 4
5
Each Gene has an NCBI Page 5
6
Especially Bibliography 6
7
And Pathways 7
8
Literature Mining Terms Corpus: Collection of documents. E.g. all papers in PubMed Term frequency: Number of times a word appears in a document. E.g. “polymerase” appeared 41 times in a paper Document frequency: Number of documents a word appears in. E.g. 1234x papers has the word “transcription” Collection frequency: Total number of times a word appears in a corpus. E.g. “transcription” appeared 6789X times in all of PubMed indexed papers Stop words: Words in the corpus that contribute little to meaning. E.g. to, is, an Stemming: Group together different variations of the same word. E.g. activate vs. activated vs. activating 8
9
A document is summarized as a vector of word counts. Each dimension contains the number of times a word appears. Can calculate similarity between two documents by comparing their vectors acid 2 amino 2 analysis 1 comparison 1 control 1 environments 2 […] our 1 ”Our analysis includes comparison of amino acid environments with random control environments as well as with each of the other amino acid environments.” Documents Represented as Vectors 9
10
Comparing Two Documents Intuitive comparison between two papers correlation coefficient of their word occurrence vectors Correlation measures the strength of linear relationship between two random variables a = c(1, 3, 5, 1, 8, 20, 0, 0, 0, 3, 1) b = c(2, 3, 4, 0, 10, 25, 1, 0, 2, 4, 3) c = c(2, 0, 1, 10, 2, 4, 7, 1, 5, 0, 8) cor(a, b)0.985615Correlated cor(b, c)-0.110328Not correlated 10
11
Term Weighting Considerations Give different terms different weight Global weight –Document frequency 11
12
Term Weighting Considerations Give different terms different weight Global weight –Document frequency: Fewer documents, more weight: log(N / df). E.g. progesterone vs gene Local weight –Term frequency 12
13
Term Weighting Considerations Give different terms different weight Global weight –Document frequency: Fewer documents, more weight: log(N / df). E.g. progesterone vs gene Local weight –Term frequency: More frequent, more weight: log(1+tf). E.g. progesterone: 10 times in paper 1 vs 3 in paper 2 –Document length 13
14
Term Weighting Considerations Give different terms different weight Global weight –Document frequency: Fewer documents, more weight: log(N / df). E.g. progesterone vs gene Local weight –Term frequency: More frequent, more weight: 1 + log(tf). E.g. progesterone: 10 times in paper 1 vs 3 in paper 2 –Document length: Less weight for longer document. E.g. paper 1 200 pages vs paper 2 3 pages 14
15
Evaluate Related of Papers Related Articles –Similarity between two documents: all terms (local wt1 × local wt2 × global wt) –Pre-computed related articles for each citation –Rank ordered by relevance 15
16
GRAIL: Gene Relationships Across Implicated Loci 16 Raychaudhuri et al PLOS Genetics 2009
17
GRAIL: Gene Relationships Across Implicated Loci 17
18
GRAIL: Gene Relationships Across Implicated Loci 18
19
GRAIL: Gene Relationships Across Implicated Loci 19
20
GRAIL on Height SNPs 20
21
GRAIL on Crohn’s Disease Use literature / pathways to identify potential causal gene Find likely reproducible SNP hits, and increase statistical power 21
22
GWAS SNPs Association <> Causal What’s the most likely causal SNP / Gene in LD with the genotyped SNP? Use functional genomics to identify the disease tissue of origin What’s the SNP doing in non-coding regions? RSNPs 22
23
Identifying Causal Cell-type for Complex Disease E.g. Rheumatoid Arthritis (RA) Many cell types implicated over the years, ranging from neutrophils, synoviocytes, and all classes of lymphocytes! It is difficult to establish causality complex phenotypes in human Use expression data: Comprehensive and unbiased, publicly available 23
24
Immunological Genome Project Start with a list of disease SNPs Find genes near the SNP that are specifically expressed in a cell type Identify cell types that have many such genes... more than expected by chance 24
25
Identifying Causal Cell-type for Complex Disease From Expression Negative control: simulation from random set of SNPs P-value: proportion of simulations exceeding the observed enrichment 25 Hu et al, American Journal of Human Genetics, 2011
26
26
27
27
28
GWAS SNPs Association <> Causal What’s the most likely causal SNP / Gene in LD with the genotyped SNP? Use functional genomics to identify the disease tissue of origin What’s the SNP doing in non-coding regions? eQTL and RSNPs 28
29
GWAS SNP Distribution RSNP 29
30
eQTL eQTL: use expression as phenotype –Are there SNPs that are associated with expression changes? –Heritable genetic variation for transcription levels 30
31
RSNPs A SNP influences TF binding, affecting downstream (disease- related) gene expression 31
32
eQTL and RSNPs eQTL: use expression as phenotype –Are there SNPs that are associated with expression changes? –Heritable genetic variation for transcription levels RSNP: regulatory SNP –Much of the influential variation is located cis- to the coding locus –In humans, mouse, and maize, 35%-50% of the genetic basis for intraspecific differences in transcription level are cis- to the coding locus (e.g. Morley et al. 2004; Schadt et al. 2003; Stranger et al. 2005; Cheung et al. 2005, etc.). 32
33
33 Huang et al, Nat Genet 2014
34
RSNPs from GWAS Enriched in regulatory sequences (promoters and enhancers) that are identified through histone mark ChIP-seq or DNase-seq 34 Maurano et al, Science 2012
35
Highest Correlated Genes of Distal DHSs Harboring GWAS Variants 35
36
Trans-Effect of Cis-SNPs Three risk loci for ESR1, MYC, and KLF4 Effect on TF expression is small, but much strong when looking at the expression of their downstream target genes 36 Li et al, Cell 2013
37
Useful Tools to Understand RSNPs Identify putative TFs whose binding might be influences by SNPs based on ENCODE ChIP-seq / DNase-seq data 37
38
Understanding GWAS SNPs Association <> Causal Use literature and pathways to identify the putative causal SNP / Gene in LD with the genotyped SNP Use (cell-type specific) expression and epigenomics to: –Identify the disease tissue of origin –Identify regulatory SNPs that affect TF binding and influence the expression of important downstream disease genes 38
39
Acknowledgement Soumya Raychaudhuri Manolis Dermitzakis 39
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.