Integrating Data for Analysis, Anonymization, and SHaring Supported by the NIH Grant U54HL108460 to the University of California, San Diego Shuang Wang,

Slides:



Advertisements
Similar presentations
Design Rule Generation for Interconnect Matching Andrew B. Kahng and Rasit Onur Topaloglu {abk | rtopalog University of California, San Diego.
Advertisements

Kien A. Hua Division of Computer Science University of Central Florida.
1 Health Warning! All may not be what it seems! These examples demonstrate both the importance of graphing data before analysing it and the effect of outliers.
Presented to: Oklahoma State Board of Education August 20, 2013 Oklahoma Interruption Investigation Arthur Thacker.
BPS - 5th Ed. Chapter 241 One-Way Analysis of Variance: Comparing Several Means.
Exciting experience in participating EDM forum commissioned projects Protect Patient Privacy When Sharing Data for CER 12/01/11 – 6/01/12 Write a commissioned.
Departments of Medicine and Biostatistics
Identity Management Based on P3P Authors: Oliver Berthold and Marit Kohntopp P3P = Platform for Privacy Preferences Project.
Correlation Mechanics. Covariance The variance shared by two variables When X and Y move in the same direction (i.e. their deviations from the mean are.
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
Introduction and Overview “the grid” – a proposed distributed computing infrastructure for advanced science and engineering. Purpose: grid concept is motivated.
ANCOVA Psy 420 Andrew Ainsworth. What is ANCOVA?
Midterm Review Goodness of Fit and Predictive Accuracy
CSCI 4550/8556 Computer Networks Comer, Chapter 20: IP Datagrams and Datagram Forwarding.
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
APPLAUS: A Privacy-Preserving Location Proof Updating System for Location-based Services Zhichao Zhu and Guohong Cao Department of Computer Science and.
1 By Vanessa Newey. 2 Introduction Background Scalability in Distributed Simulation Traditional Aggregation Techniques Problems with Traditional Methods.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Graph Regularized Dual Lasso for Robust eQTL Mapping Wei Cheng 1 Xiang Zhang 2 Zhishan Guo 1 Yu Shi 3 Wei.
Statistical Critical Path Selection for Timing Validation Kai Yang, Kwang-Ting Cheng, and Li-C Wang Department of Electrical and Computer Engineering University.
Intelligible Models for Classification and Regression
Biomedical research methods. What are biomedical research methods? An integrated approach using chemical, mathematical and computer simulations, in vitro.
Brar et al, JACC 2011 Impact of Platelet Reactivity When On-treatment With Clopidogrel on Mortality, MI or Stent Thrombosis After PCI Impact of Platelet.
1 Formal Evaluation Techniques Chapter 7. 2 test set error rates, confusion matrices, lift charts Focusing on formal evaluation methods for supervised.
The analyses upon which this publication is based were performed under Contract Number HHSM C sponsored by the Center for Medicare and Medicaid.
Cong Wang1, Qian Wang1, Kui Ren1 and Wenjing Lou2
An Evaluation of Planning and Scheduling Operations in Services A Thesis Proposal By Samuel Chukwuemeka Department of Computer Science Troy University.
IOT5_ GISFI # 05, June 20 – 22, 2011, Hyderabad, India 1 Privacy Requirements of User Data in Smart Grids Jaydip Sen Tata Consultancy Services Ltd.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 12 Analyzing the Association Between Quantitative Variables: Regression Analysis Section.
Data mining and machine learning A brief introduction.
Measuring Sovereign Contagion in Europe Presented by Jingjing XIA Caporin, Pelizzon, Ravazzolo, and Rigobon (2013)
The analyses upon which this publication is based were performed under Contract Number HHSM C sponsored by the Center for Medicare and Medicaid.
Health Datasets in Spatial Analyses: The General Overview Lukáš MAREK Department of Geoinformatics, Faculty.
Presenter : Ching-ting Lin Instructor: Ming-puu Chen Developing a Usability Evaluation Method for E-learning Application: From Functional Usability to.
Integrating Data for Analysis, Anonymization, and SHaring Supported by the NIH Grant U54HL to the University of California, San Diego HUGO: Hierarchical.
Haplotype-Based Noise- Adding Approach to Genomic Data Anonymization Yongan Zhao, Xiaofeng Wang and Haixu Tang School of Informatics and Computing, Indiana.
Bayesian Analysis and Applications of A Cure Rate Model.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
Assorted Topics Introduction AJAX What is it? Why is it important? Examples of live applications Cloud Computing What is it? Why.
Detection and Prevention of SIP Flooding Attacks in Voice over IP Networks Jin Tang, Yu Cheng and Yong Hao Department of Electrical and Computer Engineering.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Abstract With the advent of cloud computing, data owners are motivated to outsource their complex data management systems from local sites to the commercial.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
A Trust Based Distributed Kalman Filtering Approach for Mode Estimation in Power Systems Tao Jiang, Ion Matei and John S. Baras Institute for Systems Research.
Shuang Wu REU-DIMACS, 2010 Mentor: James Abello.  Project description  Our research project Input: time data recorded from the ‘Name That Cluster’ web.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: One-way ANOVA Marshall University Genomics Core.
Gillian Raab, Chris Dibben, & Paul Burton UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, 2015 Running an analysis of combined.
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
SERIT – SG8 Contributo per la preparazione di HORIZON 2020.
m-Privacy for Collaborative Data Publishing
XIAO WU DATA ANALYSIS & BASIC STATISTICS.
Presented By Amarjit Datta
A Reliability-oriented Transmission Service in Wireless Sensor Networks Yunhuai Liu, Yanmin Zhu and Lionel Ni Computer Science and Engineering Hong Kong.
Hypothesis Testing Introduction to Statistics Chapter 8 Feb 24-26, 2009 Classes #12-13.
August 2002BioCoRE 2002 Survey1 D. Brandon, R. Brunner, K. Vandivort and G. Budescu August 2002.
Structured Protocol Representation for the Cancer Biomedical Informatics Grid: caSPR and caPRI.
Approaches to quantitative data analysis Lara Traeger, PhD Methods in Supportive Oncology Research.
WELCOME TO BIOSTATISTICS! WELCOME TO BIOSTATISTICS! Course content.
Simulation-based inference beyond the introductory course Beth Chance Department of Statistics Cal Poly – San Luis Obispo
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Methods of Presenting and Interpreting Information Class 9.
Simulation setup Model parameters for simulations were tuned using repeated measurement data from multiple in-house completed studies and baseline data.
A consolidated review of multiple analyses using JMP Clinical
ModelChain: Decentralized Privacy-Preserving Healthcare Predictive Modeling Framework on Private Blockchain Networks Tsung-Ting Kuo, Chun-Nan Hsu, and.
Chapter 1 Characterization of Distributed Systems
Can Statistical monitoring really improve data integrity?
Data Mining AERS FDA’s (Spontaneous) Adverse Event Reporting System Division of Drug Risk Evaluation Office of Drug Safety Carolyn McCloskey, M.D., M.P.H.
Comparisons among methods to analyze clustered multivariate biomarker predictors of a single binary outcome Xiaoying Yu, PhD Department of Preventive Medicine.
Ho-Ramammorthy 2 phase snapshot algorithm PRESENTATION
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Presentation transcript:

integrating Data for Analysis, Anonymization, and SHaring Supported by the NIH Grant U54HL to the University of California, San Diego Shuang Wang, 1 Xiaoqian Jiang, 1 Yuan Wu, 1 Lijuan Cui, 2 and Samuel Cheng 2, Lucila Ohno-Machado 1 EXpectation Propagation LOgistic REgRession (EXPLORER): distributed privacy-preserving online model learning 1 Division of Biomedical Informatics, University of California–San Diego, La Jolla, California, USA 2 School of Electrical and Computer Engineering, University of Oklahoma, Tulsa, Oklahoma, USA Introduction EXPLORER framework Summary of Conclusions It has been shown in last decade that data privacy cannot be maintained by simply removing patient identities. Thus, training data in one institute cannot be exchanged or shared with other institutions directly for the purposes of global logistic regression model learning. To address such a challenge, numerous privacy-preserving distributed frequentist regression models for horizontally partitioned data have been studied, among which Grid LOgistic RE- gression (GLORE) model [1] and the Secure Pooled Analysis acRoss K-site (SPARK) protocol [2] are the closest work for the method presented here. Despite its simplicity and interpretability, the distributed frequentist logistic regression approach has limitations as shown in Table 1. Table 1: Comparing EXPLORER with GLORE and SPARK References [1] Wu, Y., Jiang, X., Kim, J., Ohno-Machado, L. (2012). Grid Binary LOgistic REgression (GLORE): building shared models without sharing data. JAMIA, 19(5), [2] El Emam, K., Samet, S., Arbuckle, L., Tamblyn, R., Earle, C., & Kantarcioglu, M. (2013). A secure distributed logistic regression protocol for the detection of rare adverse drug events. JAMIA, 20(3), [3] Wang, S., Jiang, X., Wu, Y., Cui, L., Cheng, S., & Ohno-Machado, L. (2013). EXpectation Propagation LOgistic REgRession (EXPLORER): Distributed privacy-preserving online model learning. JBI, 46(3), In summary, EXPLORER offers an alternative tool for privacy-preserving distributed statistical learning. We showed empirically on multiple data sets that the results are very similar to those of ordinary logistic regression. These promising results warrant further validation in larger data sets and further refinement of the methodology. Inability to openly share (i.e., transmit) patient data without onerous processes involving pair-wise agreements between institutions may significantly slow down analyses that could produce important results for healthcare improvement and biomedical research advances. EXPLORER provides a means to mitigate this problem by relying on multiparty computation without need for extensive re-training of models, nor reliance on synchronous communications among sites. Privacy protection Asynchronous communication Online learning GLORE or SPARK ✔ EXPLORER ✔✔✔ We developed an EXpectation Propagation LOgistic REgRession (EXPLORER) model for distributed privacy-preserving online learning [3]. The proposed framework provides a high level guarantee for protecting sensitive information, since the information exchanged between the server and the client is the encrypted posterior distribution of coefficients. Through experimental results, EXPLORER shows the same performance as the traditional frequentist logistic regression model, but provides more flexibility in model updating. That is, EXPLORER can be updated one point at a time rather than having to retrain the entire data set when new observations are recorded. The proposed EXPLORER supports asynchronized communication, which relieves the participants from coordinating with one another, and prevents service breakdown from the absence of participants or interrupted communications Experimental Results Methodology Secured Intermediate iNformation Exchange (SINE) protocol DatasetDataset description# of covariates# of samples 1Simulated i.i.d. data5500 2Simulated correlated data6500 3Simulated binary data Myocardial infarction91253 Table 2: Summary of datasets used in our experiments Ordinary LREXPLORERtwo-sample Z-test β Prob. Test statistic p-value β β β β β Table 3: Distributed forward feature selection on data set 1 over 30 trials Table 4: Comparisons of H-L tests and AUCs for simulated dataset 2 with/without interaction using Ordinary LR and 4-site EXPLORER With interactionWithout interaction Ordinary LR EXPLORER H-L testtest statistics p-value AUC Averaged value Standard deviation Z-test statistics Z-test p-value The convergence speed of all 10 coefficients of the data set 4 for an asynchronous 8- site EXPLORE setup β Ordinary LREXPLORERtwo-sample Z test value std. Test statistic p-value β β β β β β β β β β β β β β β β Table 5: Learned model parameter β of dataset 3 using Ordinary LR and 2-site EXPLORER