Xintao Wu Nov 19,2015 Social Computing in Big Data Era – Privacy Preservation and Fairness Awareness 1.

Slides:



Advertisements
Similar presentations
Data Mining and Text Analytics Advertising Laura Quinn.
Advertisements

21-1 Last time Database Security  Data Inference  Statistical Inference  Controls against Inference Multilevel Security Databases  Separation  Integrity.
Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.
Xintao Wu Aug 25,2014 Research Overview 1. Outline Introduction Privacy Preserving Social Network Analysis  Input perturbation  Output perturbation.
Christian Vargas. Also known as Data Privacy or Data Protection Is the relationship between collection and spreading or exposing data and information.
Xiaowei Ying, Xintao Wu, Daniel Barbara Spectrum based Fraud Detection in Social Networks 1.
Spectrum Based RLA Detection Spectral property : the eigenvector entries for the attacking nodes,, has the normal distribution with mean and variance bounded.
Machine Learning and Data Mining Course Summary. 2 Outline  Data Mining and Society  Discrimination, Privacy, and Security  Hype Curve  Future Directions.
Anti-discrimination and privacy protection in released datasets Sara Hajian Josep Domingo-Ferrer.
Xiaowei Ying Xintao Wu Univ. of North Carolina at Charlotte 2009 SIAM Conference on Data Mining, May 1, Sparks, Nevada Graph Generation with Prescribed.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte Reconstruction from Randomized Graph via Low Rank Approximation.
Introduction to Machine Learning Anjeli Singh Computer Science and Software Engineering April 28 th 2008.
Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.
SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.
Malicious parties may employ (a) structure-based or (b) label-based attacks to re-identify users and thus learn sensitive information about their rating.
April 13, 2010 Towards Publishing Recommendation Data With Predictive Anonymization Chih-Cheng Chang †, Brian Thompson †, Hui Wang ‡, Danfeng Yao † †‡
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Autoencoders Mostafa Heidarpour
What if my organization conducts business across borders ? Your footnote Privacy and “Personal Information” have different meanings in different countries;
3-1 Chapter Three. 3-2 Secondary Data vs. Primary Data Secondary Data: Data that have been gathered previously. Primary Data: New data gathered to help.
TOWARDS IDENTITY ANONYMIZATION ON GRAPHS. INTRODUCTION.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,
Data Mining Techniques
R 18 G 65 B 145 R 0 G 201 B 255 R 104 G 113 B 122 R 216 G 217 B 218 R 168 G 187 B 192 Core and background colors: 1© Nokia Solutions and Networks 2014.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Chapter 11 Databases.
1 Controversial Issues  Data mining (or simple analysis) on people may come with a profile that would raise controversial issues of  Discrimination 
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
Privacy in computing Material/text on the slides from Chapter 10 Textbook: Pfleeger.
SFU Pushing Sensitive Transactions for Itemset Utility (IEEE ICDM 2008) Presenter: Yabo, Xu Authors: Yabo Xu, Benjam C.M. Fung, Ke Wang, Ada. W.C. Fu,
Data Warehousing Data Mining Privacy. Reading Bhavani Thuraisingham, Murat Kantarcioglu, and Srinivasan Iyer Extended RBAC-design and implementation.
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
Xiaowei Ying, Xintao Wu Univ. of North Carolina at Charlotte PAKDD-09 April 28, Bangkok, Thailand On Link Privacy in Randomizing Social Networks.
Xiaowei Ying, Leting Wu, Xintao Wu University of North Carolina at Charlotte Privacy and Spectral Analysis on Social Network Randomization.
IT Applications Theory Slideshows By Mark Kelly Vceit.com Privacy Laws.
Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte 2008 SIAM Conference on Data Mining, April 25 th Atlanta, Georgia.
Xintao Wu Jan 18, 2013 Retweeting Behavior and Spectral Graph Analysis in Social Media.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
A Whirlwind Tour of Differential Privacy
Privacy Advisory Services … … A Best Practices, Integrated Approach Insert Firm Name Here.
Copyright © 2015 by Saunders, an imprint of Elsevier Inc. All rights reserved. Chapter 3 Privacy, Confidentiality, and Security.
Tweets Discrimination Analysis
The Protection of Personal Information Bill 13 February
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Big Data Analytics Are we at risk? Dr. Csilla Farkas Director Center for Information Assurance Engineering (CIAE) Department of Computer Science and Engineering.
Space for things we might want to put at the bottom of each slide. Part 6: Open Problems 1 Marianne Winslett 1,3, Xiaokui Xiao 2, Yin Yang 3, Zhenjie Zhang.
Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies Florian Tramèr, Zhicong Huang, Erman Ayday,
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Xiaowei Ying, Kai Pan, Xintao Wu, Ling Guo Univ. of North Carolina at Charlotte SNA-KDD June 28, 2009, Paris, France Comparisons of Randomization and K-degree.
Judicial Training on Data Protection and Privacy Rights
Big data classification using neural network
University of Texas at El Paso
Michael Spiegel, Esq Timothy Shimeall, Ph.D.
IT Applications Theory Slideshows
Privacy-preserving Release of Statistics: Differential Privacy
Differential Privacy in Practice
Real-time Protection for Open Beacon Network
Data Warehousing and Data Mining
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Presented by : SaiVenkatanikhil Nimmagadda
Topological Signatures For Fast Mobility Analysis
School of Medicine Orientation Information Security Training
Presentation transcript:

Xintao Wu Nov 19,2015 Social Computing in Big Data Era – Privacy Preservation and Fairness Awareness 1

Drivers of Data Computing 2 6A’s Anytime Anywhere Access to Anything by Anyone Authorized 4V’s Volume Velocity Variety Veracity Reliability Security Privacy Usability

4V’s 3

AVC Denial Log Analysis 4 Volume and Velocity:1 million log files per day and each has thousands entries S3, Hive and EMR on AWS

Social Media Customer Analytics 5 Network topology (friendship,followship,interaction) namesexagediseasesalary AdaF18cancer25k BobM25heart110k … idSexageaddressIncome 5FYNC25k 3MYSC110k Structured profile Retweet sequence Product and review Entity resolution Patterns Temporal/spatial Scalability Visualization Sentiment Privacy Unstructured text (e.g., blog, tweet) Transaction database Variety, Veracity 10GB tweets per day Belk and Lowe’s UNCC Chancellor’s special fund

A Single View to the Customer Customer Social Media Gaming Entertain Banking Finance Banking Finance Our Known History Our Known History Purchase

Outline Introduction Privacy Preserving Social Network Analysis  Input perturbation  Output perturbation Anti-discrimination Learning 7

Privacy Breach Cases Nydia Velázquez (1994)  Medical record on her suicide attempt was disclosed AOL Search Log (2006)  Anonymized release of 650K users’ search histories lasted for less than 24 hours NetFlix Contest (2009)  $1M contest was cancelled due to privacy lawsuit 23andMe (2013)  Genetic testing was ordered to discontinue by FDA due to genetic privacy 8

Acxiom Privacy  In 2003, the EPIC alleged Acxiom provided consumer information to US Army "to determine how information from public and private records might be analyzed to help defend military bases from attack."  In 2013 Acxiom was among nine companies that the FTC investigated to see how they collect and use consumer data. Security  In 2003, more than 1.6 billion customer records were stolen during the transmission of information to and from Acxiom's clients. 9

10 Most restrictedRestricted Some restrictions Minimal restrictions Effectively no restrictions No legislation or no information Privacy Regulation -- Forrester

Privacy Protection Laws USA HIPAA for health care Grann-Leach-Bliley Act of 1999 for financial institutions COPPA for children online privacy State regulations, e.g., California State Bill 1386 Canada PIPEDA Personal Information Protection and Electronic Documents Act European Union Directive 94/46/EC - Provides guidelines for member state legislation and forbids sharing data with states that do not protect privacy Contractual obligations Individuals should have notice about how their data is used and have opt-out choices 11

Privacy Preserving Data Mining 12 ssnnameziprace…ageSexincome…disease 28223Asian…20M85k…Cancer 28223Asian…30F70k…Flu 28262Black…20M120k…Heart 28261White…26M23k…Cancer..…...… Asian…20M110k…Flu 69% unique on zip and birth date 87% with zip, birth date and gender Generalization (k-anonymity, l- diversity, t-closeness) Randomization

13 Privacy Preserving Data Mining 13

Social Network Data 14 Data owner Data miner release namesexagediseasesalary AdaF18cancer25k BobM25heart110k CathyF20cancer70k DellM65flu65k EdM60cancer300k FredM24flu20k GeorgeM22cancer45k HarryM40flu95k IreneF45heart70k idSexagediseasesalary 5FYcancer25k 3MYheart110k 6FYcancer70k 1MOflu65k 7MOcancer300k 2MYflu20k 9MYcancer45k 4MMflu95k 8FMheart70k

Threat of Re-identification 15 idSexagediseasesalary 5FYcancer25k 3MYheart110k 6FYcancer70k 1MOflu65k 7MOcancer300k 2MYflu20k 9MYcancer45k 4MMflu95k 8FMheart70k Attacker attack Privacy breaches Identity disclosure Link disclosure Attribute disclosure

Privacy Preservation in Social Network Analysis Input Perturbation K-anonymity Generalization Randomization 16

Our Work Feature preservation randomization  Spectrum preserving randomization (SDM08)  Markov chain based feature preserving randomization (SDM09) Reconstruction from randomized graph (SDM10) Link privacy (from the attacker perspective)  Exploiting node similarity feature (PAKDD09 Best Student Paper Runner-up Award)  Exploiting graph space via Markov chain (SDM09) 17

Spectrum Preserving Randomization [SDM08] Spectral Switch: To increase the eigenvalue: To decrease the eigenvalue: 18

Reconstruction from Randomized Graph [SDM10] We can reconstruct a graph from such that w/o incurring much privacy loss 19

20 Original Exploiting graph space [SDM09]

PSNet (NSF ) 21

Output Perturbation 22 Data owner Data miner namesexagediseasesalary AdaF18cancer25k BobM25heart110k CathyF20cancer70k DellM65flu65k EdM60cancer300k FredM24flu20k GeorgeM22cancer45k HarryM40flu95k IreneF45heart70k Query f Query result + noise Cannot be used to derive whether any individual is included in the database

Differential Guarantee [Dwork, TCC06] 23 namedisease Adacancer Bobheart Cathycancer Dellflu Edcancer Fredflu f count(#cancer) f(x) + noise namedisease Adacancer Bobheart Cathycancer Dellflu Edcancer Fredflu K K f count(#cancer) f(x’) + noise 3 + noise 2 + noise achieving Opt-Out

 is a privacy parameter: smaller  = stronger privacy Differential Privacy 24

Calibrating Noise 25 Laplace distribution Sensitivity of function global sensitivity l ocal sensitivity

Sensitivity 26 namesexagediseasesalary AdaF18cancer25k BobM25heart110k CathyF20cancer70k DellM65flu65k EdM60cancer300k FredM24flu20k GeorgeM22cancer45k HarryM40flu95k IreneF45heart70k Function fsensitivity Count(#cancer)1 Sum(salary)u (domain upper bound) Avg(salary)u/n Data mining tasks can be decomposed to a sequence of simple functions. L-1 distance for vector output

Challenge in OSN 27 [1,1,3,3,3,3,2][1,1,3,3,2,2,2] Degree sequence,  D=2, noise from Lap(2/  ) is needed n-2 0 # of triangles,  =n-2, huge noise is needed High sensitivity!

Advanced Mechanisms Possible theoretical approaches  Smooth sensitivity  Exponential mechanism  Functional mechanism  Sampling 28

Our Work DP-preserving cluster coefficient ( ASONAM12 ) DP-preserving spectral graph analysis (PAKDD13) Linear-refinement of DP-preserving query answering (PAKDD13 Best Application Paper) DP-preserving graph generation based on degree correlation (TDP13) Regression model fitting under differential privacy and model inversion attack (IJCAI 15) DP-preservation for deep auto-encoders ( AAAI 16 ) 29

SMASH (NIH R01GM103309) 30

Genetic Privacy (NSF and ) 31 BIBM13 Best Paper Award

Outline Introduction Privacy Preserving Social Network Analysis  Input perturbation  Output perturbation Anti-discrimination Learning 32

What is discrimination?  Discrimination refers to unjustified distinctions of individuals based on their membership in a certain group.  Federal Laws and regulations disallow discrimination on several grounds:  Gender, Age, Marital Status, Sexual Orientation, Race, Religion or Belief, Disability or Illness ……  These attributes are referred to as the protected attributes. protected groups

Predictive Learning Finding evidence of discrimination Building non discriminatory classifiers

Motivating Example 35 namesexageprogramacceptance AdaF18cancer+ BobM25heart_ CathyF20cancer+ EdM60cancer_ FredM24flu_ … Suppose 2000 applicants, 1000 M and 1000 F Acceptance ratio 36% M vs. 24% F Do we have discrimination here?

Discrimination Discovery  Assuming a causal Bayesian network that faithfully represents the data.  Discriminatory effect if ∆P > τ, where τ is a threshold for discrimination depending on law (e.g., 5%). Protected attribute Decision attribute c+, c-e+, e- ∆P = P(e+|c+) − P(e+|c−)

Motivate Examples  Case I  Case II ∆P = 0.1 ∆P = -0.01

Motivate Examples  Case II  Case III ∆P + + ∆P - + ∆P = ∆P = 0.104

Discrimination Analysis Discrimination is treatment or consideration of, or making a distinction in favor of or against, a person or thing based on the group, class, or category to which that person or thing is perceived to belong to rather than on individual merit. (Wikipedia) Tweets discrimination analysis aims to detect whether a tweet contains discrimination against gender, race, age, etc.

A Typical Deep Learning Pipeline for Text Classification Text Word Representation Multilayer Perception Recursive Neural Network Recurrent Neural Network Convolutional Neural Network Deep Learning Model Text Representation Softmax Classifier word semantic compositiontext Text Representation

Word Embeddings Tweet …

Word Embeddings Tweet … … LSTM-RNN …

Word Embeddings Tweet … … LSTM-RNN … Tweet Representation Mean Pooling

Word Embeddings Tweet … … LSTM-RNN … Mean Pooling Logistic Regression Tweet Representation

Summary 1. Preserving Privacy Values 2. Educating Robustly and Responsibly 3. Big Data and Discrimination 4. Law Enforcement & Security 5. Data as a Public Resouce 45

Acknowledgement 46 Collaborators: UNCC: Aidong Lu, Xinghua Shi, Yong Ge Oregon: Jun Li, Dejing Dou PeaceHealth: Brigitte Piniewski UIUC: Tao Xie DPL members: UNCC: PhD graduates: Songtao Guo, Ling Guo, Kai Pan, Leting Wu, Xiaowei Ying. PhD students: Yue Wang, Yuemeng Li, Zhilin Luo (visiting) UofA: Lu Zhang (postdoc), Yongkai Wu, Cheng Si, Miao Xie, Shuhan Yuan Funding support:

Genome Wide Association Study 47