Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding GWAS SNPs Xiaole Shirley Liu Stat 115/215.

Similar presentations


Presentation on theme: "Understanding GWAS SNPs Xiaole Shirley Liu Stat 115/215."— Presentation transcript:

1 Understanding GWAS SNPs Xiaole Shirley Liu Stat 115/215

2 Pace of GWAS Studies 2

3 GWAS SNPs Association <> Causal What’s the most likely causal SNP / Gene in LD with the genotyped SNP? Use functional genomics to identify the disease tissue of origin What’s the SNP doing in non-coding regions? RSNPs 3

4 Use Literature & Pathway Information to Identify Putative Causal SNPs / Genes 4

5 Each Gene has an NCBI Page 5

6 Especially Bibliography 6

7 And Pathways 7

8 Literature Mining Terms Corpus: Collection of documents. E.g. all papers in PubMed Term frequency: Number of times a word appears in a document. E.g. “polymerase” appeared 41 times in a paper Document frequency: Number of documents a word appears in. E.g. 1234x papers has the word “transcription” Collection frequency: Total number of times a word appears in a corpus. E.g. “transcription” appeared 6789X times in all of PubMed indexed papers Stop words: Words in the corpus that contribute little to meaning. E.g. to, is, an Stemming: Group together different variations of the same word. E.g. activate vs. activated vs. activating 8

9 A document is summarized as a vector of word counts. Each dimension contains the number of times a word appears. Can calculate similarity between two documents by comparing their vectors acid 2 amino 2 analysis 1 comparison 1 control 1 environments 2 […] our 1 ”Our analysis includes comparison of amino acid environments with random control environments as well as with each of the other amino acid environments.” Documents Represented as Vectors 9

10 Comparing Two Documents Intuitive comparison between two papers  correlation coefficient of their word occurrence vectors Correlation measures the strength of linear relationship between two random variables a = c(1, 3, 5, 1, 8, 20, 0, 0, 0, 3, 1) b = c(2, 3, 4, 0, 10, 25, 1, 0, 2, 4, 3) c = c(2, 0, 1, 10, 2, 4, 7, 1, 5, 0, 8) cor(a, b)0.985615Correlated cor(b, c)-0.110328Not correlated 10

11 Term Weighting Considerations Give different terms different weight Global weight –Document frequency 11

12 Term Weighting Considerations Give different terms different weight Global weight –Document frequency: Fewer documents, more weight: log(N / df). E.g. progesterone vs gene Local weight –Term frequency 12

13 Term Weighting Considerations Give different terms different weight Global weight –Document frequency: Fewer documents, more weight: log(N / df). E.g. progesterone vs gene Local weight –Term frequency: More frequent, more weight: log(1+tf). E.g. progesterone: 10 times in paper 1 vs 3 in paper 2 –Document length 13

14 Term Weighting Considerations Give different terms different weight Global weight –Document frequency: Fewer documents, more weight: log(N / df). E.g. progesterone vs gene Local weight –Term frequency: More frequent, more weight: 1 + log(tf). E.g. progesterone: 10 times in paper 1 vs 3 in paper 2 –Document length: Less weight for longer document. E.g. paper 1 200 pages vs paper 2 3 pages 14

15 Evaluate Related of Papers Related Articles –Similarity between two documents:  all terms (local wt1 × local wt2 × global wt) –Pre-computed related articles for each citation –Rank ordered by relevance 15

16 GRAIL: Gene Relationships Across Implicated Loci 16 Raychaudhuri et al PLOS Genetics 2009

17 GRAIL: Gene Relationships Across Implicated Loci 17

18 GRAIL: Gene Relationships Across Implicated Loci 18

19 GRAIL: Gene Relationships Across Implicated Loci 19

20 GRAIL on Height SNPs 20

21 GRAIL on Crohn’s Disease Use literature / pathways to identify potential causal gene Find likely reproducible SNP hits, and increase statistical power 21

22 GWAS SNPs Association <> Causal What’s the most likely causal SNP / Gene in LD with the genotyped SNP? Use functional genomics to identify the disease tissue of origin What’s the SNP doing in non-coding regions? RSNPs 22

23 Identifying Causal Cell-type for Complex Disease E.g. Rheumatoid Arthritis (RA) Many cell types implicated over the years, ranging from neutrophils, synoviocytes, and all classes of lymphocytes! It is difficult to establish causality complex phenotypes in human Use expression data: Comprehensive and unbiased, publicly available 23

24 Immunological Genome Project Start with a list of disease SNPs Find genes near the SNP that are specifically expressed in a cell type Identify cell types that have many such genes... more than expected by chance 24

25 Identifying Causal Cell-type for Complex Disease From Expression Negative control: simulation from random set of SNPs P-value: proportion of simulations exceeding the observed enrichment 25 Hu et al, American Journal of Human Genetics, 2011

26 26

27 27

28 GWAS SNPs Association <> Causal What’s the most likely causal SNP / Gene in LD with the genotyped SNP? Use functional genomics to identify the disease tissue of origin What’s the SNP doing in non-coding regions? eQTL and RSNPs 28

29 GWAS SNP Distribution RSNP 29

30 eQTL eQTL: use expression as phenotype –Are there SNPs that are associated with expression changes? –Heritable genetic variation for transcription levels 30

31 RSNPs A SNP influences TF binding, affecting downstream (disease- related) gene expression 31

32 eQTL and RSNPs eQTL: use expression as phenotype –Are there SNPs that are associated with expression changes? –Heritable genetic variation for transcription levels RSNP: regulatory SNP –Much of the influential variation is located cis- to the coding locus –In humans, mouse, and maize, 35%-50% of the genetic basis for intraspecific differences in transcription level are cis- to the coding locus (e.g. Morley et al. 2004; Schadt et al. 2003; Stranger et al. 2005; Cheung et al. 2005, etc.). 32

33 33 Huang et al, Nat Genet 2014

34 RSNPs from GWAS Enriched in regulatory sequences (promoters and enhancers) that are identified through histone mark ChIP-seq or DNase-seq 34 Maurano et al, Science 2012

35 Highest Correlated Genes of Distal DHSs Harboring GWAS Variants 35

36 Trans-Effect of Cis-SNPs Three risk loci for ESR1, MYC, and KLF4 Effect on TF expression is small, but much strong when looking at the expression of their downstream target genes 36 Li et al, Cell 2013

37 Useful Tools to Understand RSNPs Identify putative TFs whose binding might be influences by SNPs based on ENCODE ChIP-seq / DNase-seq data 37

38 Understanding GWAS SNPs Association <> Causal Use literature and pathways to identify the putative causal SNP / Gene in LD with the genotyped SNP Use (cell-type specific) expression and epigenomics to: –Identify the disease tissue of origin –Identify regulatory SNPs that affect TF binding and influence the expression of important downstream disease genes 38

39 Acknowledgement Soumya Raychaudhuri Manolis Dermitzakis 39


Download ppt "Understanding GWAS SNPs Xiaole Shirley Liu Stat 115/215."

Similar presentations


Ads by Google