Presentation is loading. Please wait.

Presentation is loading. Please wait.

Resolving membership in a study in shared aggregate genetics data David W. Craig, Ph.D. Investigator & Associate Director Neurogenomics Division

Similar presentations


Presentation on theme: "Resolving membership in a study in shared aggregate genetics data David W. Craig, Ph.D. Investigator & Associate Director Neurogenomics Division"— Presentation transcript:

1 Resolving membership in a study in shared aggregate genetics data David W. Craig, Ph.D. Investigator & Associate Director Neurogenomics Division dcraig@tgen.org

2 Genome-wide Association Studies Nature Reviews Genetics  Genome-wide Association Studies (GWAS) genotype millions of Single Nucleotide Polymorphisms (SNPs) across 1000’s of individuals.  SNPs are typically biallic and diploid:  CC/CT/TT  00/01/11  Due to ancestral meiotic recombination, SNPs are not independent from neighboring variants. They are often in linkage disequilibrium.  The concept of LD means that a SNP may be associated with disease, due to underlying correlation with a different functional variant.  Summary stats for a SNP across hundreds/thousands of individuals:  33% C / 77% T for cases and 45% C / 55% T  P=10 -8  CC=508 / CT=250 / TT= 108  OR=1.8

3 Resolving Identity from aggregate genetics data  GWAS are expensive, requiring genotyping of 1000’s of individuals.  Often require consortiums of consortiums.  Sharing individual-level data was and is a challenge.  Sharing meta-data is a reasonable option.  In 2007, summary allele frequency and genotype counts were routinely placed on the web for all SNPs.  In 2008, after broad deliberation with the scientific community we published a forensics paper showing that one could have crude estimates of allele frequency, yet still resolve individuals.  Resolve is the term we purposely use. Identify has multiple meanings, particularly in GWAS study

4 Example Aggregate Data  rs90325225%26%  rs23232315%15%  rs32355529%29%  rs23234373%75%  rs23343221%22%  rs2343125.1%5.1%  rs1632323.1%2.8%  rs839273115%16%  rs2387647.3%7.1%  rs38374545%54% % A allele ~500 cases % A allele ~500 controls Other SNP Aggregate Data Types: Genotypes, odds ratios, p-values, etc.

5 Visual example (SNP data as visualized) AA=1.0 AB=0.5 BB= 0 250,000 pixels

6 Merge 96 independent data images equally

7 After merging, individual images still resolvable No AdjustmentAuto Contrast & Smooth Filter

8 Conceptual Approach  Rs90325225%35%100%+10  Rs23232315%13%50%-2  Rs32355529%39%100%+10  Rs23234373%51%0%+22  Rs23343221%32%100%+11  Rs2343125%15%50%+10  Rs1632323%0%0%+3 ….. …..…..…..….. Data Set of Question Person Of Interest Directional score Reference Data Set SNP

9 Reference Data Set  Rs90325225%35%100%+10  Rs23232315%13%50%-2  Rs32355529%39%100%+10  Rs23234373%51%0%+22  Rs23343221%32%100%+11  Rs2343125%15%50%+10  Rs1632323%0%0%+3 ….. …..…..…..….. Data Set of Question Person Of Interest Directional score SNP Equations (one approach of many!!) D = 9.1 sd( D ) = 7.4 s = 7 T = D / ( sd( D )/√ s ) 3.2 = 9.1 / ( 7.4/√7 )

10 Resolving Individuals in Aggregate Data Sets

11 Results on pooled samples

12 Impact  NIH policy was changed  Summary-level data is no longer freely available on the web in a distributed unrestrictive manner.  Additional papers refined the math and described limitations

13 Managing Risk  Distributing results of studies on human subjects inherently increases the the risk of a person being identifiable..  Context is important. The concept of Positive Predictive Value (PPV) can provide a measure.  PPV can also account for ‘at-risk’ populations.  Currently, working with NIH on guidance for measuring risk with a given dataset  The approaches leveraged a critical concept of directionality, specific to genotype data and frequency tables.  P-values represent a fundamentally different datatype with low information content

14 A new era

15 The era of whole-genome sequencing is approaching  SNPs are common and usually defined as greater than 1%  Whole-genome sequencing and exome sequencing inherently measure rare variants.  Rare variants can be highly informative, particularly in combination.  Approaches need to be explored for summarizing results without revealing identity.

16 Acknowledgements  Lab  Jennifer Dinh  Szabolcs Szelinger  Holly Benson  Meredith Sanchez-Castillo  Brooke Hjelm  Informatics  Nils Homer, Ph.D.  Tyler Izatt  Jessica Aldrich  Alexis Christoforides  Ahmet Kurdoglu  James Long  Shripad Sinari Funding NINDS U24NS051872 State of Arizona NHGRI U01HG005210 This work: ENDGAME (NHLBI U01 HL086528 )

17 Thank you


Download ppt "Resolving membership in a study in shared aggregate genetics data David W. Craig, Ph.D. Investigator & Associate Director Neurogenomics Division"

Similar presentations


Ads by Google