High dimensional genomic data, identifiability, and query-response Haixu Tang School of Informatics and Computing Indiana University, Bloomington
“Big Data” in Personal Genomics Genomics is a key component of personalized medicine – Massive Large research-oriented projects: 1000 genomes to 10 6 Genome sequencing for all new-borns? Open data project, e.g., the Personal Genomics Project (PGP) – Heterogeneous Genomic sequence (variations) Constant, dynamic monitoring – Transcritpomics, proteomics, metabolomics, microbial communities, etc. (as demonstrated by iPOP)
Challenges in Personal Genomics Personalized HealthcareResearch (secondary) AnalysisDetection of markers for diagnosis and treatment (pharmacogenomics) Discovery of markers SharingSharing patient data among health practitioners Searching for successful treatment on similar patients (“patient like me”) Methodology development Validation of markers Challenges: Speed, Storage, Scalability, Security Solution: cloud, hybrid cloud, bring computing to the data!
Privacy Enhancing Technologies Personalized HealthcareResearch (Secondary) AnalysisDetection of markers for diagnosis and treatment (pharmacogenomics) Discovery of markers SharingSharing patient data among health practitioners Searching for successful treatment on similar patients (“patient like me”) Methodology development Validation of markers Cryptographic protocols: SMC, homomorphic computation, functional encryption Database security approaches: access control, query auditing, differential privacy Ethic studies, informed consent, policy
What is specific for genomic data? Challenges – Genome technologies evolve very fast! – Genomic data are extremely high dimensional Millions of SNPs, easily identifiable Balance between data security and utility – Not only the data, but also analysis results need to be protected Allele frequencies or test statistics (e.g., Homer’s attack) Special properties – Different dimensions are NOT independent Genetic structures (e.g., linkage disequilibrium) – Specific genomic research focuses on a small number of dimensions (e.g., disease-associated SNPs)