Using Data Privacy for Better Adaptive Predictions Vitaly Feldman IBM Research – Almaden Foundations of Learning Theory, 2014 Cynthia Dwork Moritz Hardt.

Slides:



Advertisements
Similar presentations
Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia.
Advertisements

Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Linear Classifiers (perceptrons)
Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn.
An Overview of Machine Learning
Bayesian Neural Networks. Bayesian statistics An example of Bayesian statistics: “The probability of it raining tomorrow is 0.3” Suppose we want to reason.
Privacy Enhancing Technologies
Kunal Talwar MSR SVC [Dwork, McSherry, Talwar, STOC 2007] TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A AA A.
Yi Wu (CMU) Joint work with Parikshit Gopalan (MSR SVC) Ryan O’Donnell (CMU) David Zuckerman (UT Austin) Pseudorandom Generators for Halfspaces TexPoint.
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Current Developments in Differential Privacy Salil Vadhan Center for Research on Computation & Society School of Engineering & Applied Sciences Harvard.
Differentially Private Data Release for Data Mining Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University Montreal,
Task 1: Privacy Preserving Genomic Data Sharing Presented by Noman Mohammed School of Computer Science McGill University 24 March 2014.
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
R 18 G 65 B 145 R 0 G 201 B 255 R 104 G 113 B 122 R 216 G 217 B 218 R 168 G 187 B 192 Core and background colors: 1© Nokia Solutions and Networks 2014.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
The Complexity of Differential Privacy Salil Vadhan Harvard University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Data mining and machine learning A brief introduction.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Privacy by Learning the Database Moritz Hardt DIMACS, October 24, 2012.
Privacy of profile-based ad targeting Alexander Smal and Ilya Mironov.
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
On the power and the limits of evolvability Vitaly Feldman Almaden Research Center.
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
Review of Research Methods. Overview of the Research Process I. Develop a research question II. Develop a hypothesis III. Choose a research design IV.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Identification and Enumeration of Waterfowl using Neural Network Techniques Michael Cash ECE 539 Final Project 12/19/03.
A Whirlwind Tour of Differential Privacy
1 What Can We Learn Privately? Sofya Raskhodnikova Penn State University Joint work with Shiva Kasiviswanathan Los Alamos Homin Lee UT Austin Kobbi Nissim.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Usman Roshan Dept. of Computer Science NJIT
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Constructing a Predictor to Identify Drug and Adverse Event Pairs
Private Data Management with Verification
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Understanding Generalization in Adaptive Data Analysis
Privacy-preserving Release of Statistics: Differential Privacy
Generalization and adaptivity in stochastic convex optimization
Algorithmic Approaches to Preventing Overfitting in Adaptive Data Analysis Part 1 Aaron Roth.
Vitaly (the West Coast) Feldman
Current Developments in Differential Privacy
Preserving Validity in Adaptive Data Analysis
Scott Aaronson (UT Austin) MIT, November 20, 2018
If I keep a plant from getting energy from sunlight, it will die.
Privacy-preserving Prediction
Gentle Measurement of Quantum States and Differential Privacy
Predicting Unroll Factors Using Supervised Classification
Scott Aaronson (UT Austin) UNM, Albuquerque, October 18, 2018
Generalization bounds for uniformly stable algorithms
The reusable holdout: Preserving validity in adaptive data analysis
CS639: Data Management for Data Science
Gentle Measurement of Quantum States and Differential Privacy *
Presentation transcript:

Using Data Privacy for Better Adaptive Predictions Vitaly Feldman IBM Research – Almaden Foundations of Learning Theory, 2014 Cynthia Dwork Moritz Hardt Omer Reingold Aaron Roth MSR SVC IBM Almaden MSR SVC UPenn, CS

Statistical inference Genome Wide Association Studies Given: DNA sequences with medical records Discover: Find SNPs associated with diseases Predict chances of developing some condition Predict drug effectiveness Hypothesis testing

Existing approaches

Real world is interactive Outcomes of analyses inform future manipulations on the same data Exploratory data analysis Model selection Feature selection Hyper-parameter tuning Public data - findings inform others Samples are no longer i.i.d.!

Is the issue real? “Such practices can distort the significance levels of conventional statistical tests. The existence of this effect is well known, but its magnitude may come as a surprise, even to a hardened statistician.”

competitions Public Private Private data Public score Data Private score “If you based your model solely on the data which gave you constant feedback, you run the danger of a model that overfits to the specific noise in that data.” –Kaggle FAQ.

Adaptive statistical queries Learning algorithm(s) SQ oracle [K93, F GRVX13] Can measure error/performance and test hypotheses Can be used in place of samples in most algorithms!

SQ algorithms PAC learning algorithms (except parities) Convex optimization (Ellipsoid, iterative methods) Expectation maximization (EM) SVM (with kernel) PCA ICA ID3 k-means method of moments MCMC Naïve Bayes Neural Networks (backprop) Perceptron Nearest neighbors Boosting [K 93, BDMN 05, CKLYBNO 06, F PV 14]

Naïve answering Chernoff Union

Our result

Fresh samples Data set analyzed differentially privately

Privacy-preserving data analysis How to get utility from data while preserving privacy of individuals DATA

Differential Privacy Each sample point is created from personal data of an individual (GTTCACG…TC, “YES”) Differential Privacy [DMNS06]

Properties of DP

DP implies generalization DP composition implies that DP preserving algorithms can reuse data adaptively

Proof

Counting queries Data analyst(s) Query release algorithm

From private counting to SQs From private counting to SQs

Proof I

Proof II

Proof: moment bound

Corollaries [HR10]

MWU + Sparse Vector Laplace noise

Threshold validation queries

Applications SQ oracle Learning algorithm(s)

Conclusions Adaptive data manipulations can cause overfitting/false discovery Theoretical model of the problem based on SQs Using exact empirical means is risky DP provably preserves “freshness” of samples: adding noise can provably prevent overfitting In applications not all data must be used with DP

Future work THANKS!