Download presentation
Presentation is loading. Please wait.
Published byJordan Peters Modified over 9 years ago
1
Xintao Wu Nov 19,2015 Social Computing in Big Data Era – Privacy Preservation and Fairness Awareness 1
2
Drivers of Data Computing 2 6A’s Anytime Anywhere Access to Anything by Anyone Authorized 4V’s Volume Velocity Variety Veracity Reliability Security Privacy Usability
3
4V’s 3
4
AVC Denial Log Analysis 4 Volume and Velocity:1 million log files per day and each has thousands entries S3, Hive and EMR on AWS
5
Social Media Customer Analytics 5 Network topology (friendship,followship,interaction) namesexagediseasesalary AdaF18cancer25k BobM25heart110k … idSexageaddressIncome 5FYNC25k 3MYSC110k Structured profile Retweet sequence Product and review Entity resolution Patterns Temporal/spatial Scalability Visualization Sentiment Privacy Unstructured text (e.g., blog, tweet) Transaction database Variety, Veracity 10GB tweets per day Belk and Lowe’s UNCC Chancellor’s special fund
6
A Single View to the Customer Customer Social Media Gaming Entertain Banking Finance Banking Finance Our Known History Our Known History Purchase
7
Outline Introduction Privacy Preserving Social Network Analysis Input perturbation Output perturbation Anti-discrimination Learning 7
8
Privacy Breach Cases Nydia Velázquez (1994) Medical record on her suicide attempt was disclosed AOL Search Log (2006) Anonymized release of 650K users’ search histories lasted for less than 24 hours NetFlix Contest (2009) $1M contest was cancelled due to privacy lawsuit 23andMe (2013) Genetic testing was ordered to discontinue by FDA due to genetic privacy 8
9
Acxiom Privacy In 2003, the EPIC alleged Acxiom provided consumer information to US Army "to determine how information from public and private records might be analyzed to help defend military bases from attack." In 2013 Acxiom was among nine companies that the FTC investigated to see how they collect and use consumer data. Security In 2003, more than 1.6 billion customer records were stolen during the transmission of information to and from Acxiom's clients. 9
10
10 Most restrictedRestricted Some restrictions Minimal restrictions Effectively no restrictions No legislation or no information Privacy Regulation -- Forrester
11
Privacy Protection Laws USA HIPAA for health care Grann-Leach-Bliley Act of 1999 for financial institutions COPPA for children online privacy State regulations, e.g., California State Bill 1386 Canada PIPEDA 2000 - Personal Information Protection and Electronic Documents Act European Union Directive 94/46/EC - Provides guidelines for member state legislation and forbids sharing data with states that do not protect privacy Contractual obligations Individuals should have notice about how their data is used and have opt-out choices 11
12
Privacy Preserving Data Mining 12 ssnnameziprace…ageSexincome…disease 28223Asian…20M85k…Cancer 28223Asian…30F70k…Flu 28262Black…20M120k…Heart 28261White…26M23k…Cancer..…...…. 28223Asian…20M110k…Flu 69% unique on zip and birth date 87% with zip, birth date and gender Generalization (k-anonymity, l- diversity, t-closeness) Randomization
13
13 Privacy Preserving Data Mining 13
14
Social Network Data 14 Data owner Data miner release namesexagediseasesalary AdaF18cancer25k BobM25heart110k CathyF20cancer70k DellM65flu65k EdM60cancer300k FredM24flu20k GeorgeM22cancer45k HarryM40flu95k IreneF45heart70k idSexagediseasesalary 5FYcancer25k 3MYheart110k 6FYcancer70k 1MOflu65k 7MOcancer300k 2MYflu20k 9MYcancer45k 4MMflu95k 8FMheart70k
15
Threat of Re-identification 15 idSexagediseasesalary 5FYcancer25k 3MYheart110k 6FYcancer70k 1MOflu65k 7MOcancer300k 2MYflu20k 9MYcancer45k 4MMflu95k 8FMheart70k Attacker attack Privacy breaches Identity disclosure Link disclosure Attribute disclosure
16
Privacy Preservation in Social Network Analysis Input Perturbation K-anonymity Generalization Randomization 16
17
Our Work Feature preservation randomization Spectrum preserving randomization (SDM08) Markov chain based feature preserving randomization (SDM09) Reconstruction from randomized graph (SDM10) Link privacy (from the attacker perspective) Exploiting node similarity feature (PAKDD09 Best Student Paper Runner-up Award) Exploiting graph space via Markov chain (SDM09) 17
18
Spectrum Preserving Randomization [SDM08] Spectral Switch: To increase the eigenvalue: To decrease the eigenvalue: 18
19
Reconstruction from Randomized Graph [SDM10] We can reconstruct a graph from such that w/o incurring much privacy loss 19
20
20 Original Exploiting graph space [SDM09]
21
PSNet (NSF-0831204) 21
22
Output Perturbation 22 Data owner Data miner namesexagediseasesalary AdaF18cancer25k BobM25heart110k CathyF20cancer70k DellM65flu65k EdM60cancer300k FredM24flu20k GeorgeM22cancer45k HarryM40flu95k IreneF45heart70k Query f Query result + noise Cannot be used to derive whether any individual is included in the database
23
Differential Guarantee [Dwork, TCC06] 23 namedisease Adacancer Bobheart Cathycancer Dellflu Edcancer Fredflu f count(#cancer) f(x) + noise namedisease Adacancer Bobheart Cathycancer Dellflu Edcancer Fredflu K K f count(#cancer) f(x’) + noise 3 + noise 2 + noise achieving Opt-Out
24
is a privacy parameter: smaller = stronger privacy Differential Privacy 24
25
Calibrating Noise 25 Laplace distribution Sensitivity of function global sensitivity l ocal sensitivity
26
Sensitivity 26 namesexagediseasesalary AdaF18cancer25k BobM25heart110k CathyF20cancer70k DellM65flu65k EdM60cancer300k FredM24flu20k GeorgeM22cancer45k HarryM40flu95k IreneF45heart70k Function fsensitivity Count(#cancer)1 Sum(salary)u (domain upper bound) Avg(salary)u/n Data mining tasks can be decomposed to a sequence of simple functions. L-1 distance for vector output
27
Challenge in OSN 27 [1,1,3,3,3,3,2][1,1,3,3,2,2,2] Degree sequence, D=2, noise from Lap(2/ ) is needed n-2 0 # of triangles, =n-2, huge noise is needed High sensitivity!
28
Advanced Mechanisms Possible theoretical approaches Smooth sensitivity Exponential mechanism Functional mechanism Sampling 28
29
Our Work DP-preserving cluster coefficient ( ASONAM12 ) DP-preserving spectral graph analysis (PAKDD13) Linear-refinement of DP-preserving query answering (PAKDD13 Best Application Paper) DP-preserving graph generation based on degree correlation (TDP13) Regression model fitting under differential privacy and model inversion attack (IJCAI 15) DP-preservation for deep auto-encoders ( AAAI 16 ) 29
30
SMASH (NIH R01GM103309) 30
31
Genetic Privacy (NSF 1502273 and 1523115) 31 BIBM13 Best Paper Award
32
Outline Introduction Privacy Preserving Social Network Analysis Input perturbation Output perturbation Anti-discrimination Learning 32
33
What is discrimination? Discrimination refers to unjustified distinctions of individuals based on their membership in a certain group. Federal Laws and regulations disallow discrimination on several grounds: Gender, Age, Marital Status, Sexual Orientation, Race, Religion or Belief, Disability or Illness …… These attributes are referred to as the protected attributes. protected groups
34
Predictive Learning Finding evidence of discrimination Building non discriminatory classifiers
35
Motivating Example 35 namesexageprogramacceptance AdaF18cancer+ BobM25heart_ CathyF20cancer+ EdM60cancer_ FredM24flu_ … Suppose 2000 applicants, 1000 M and 1000 F Acceptance ratio 36% M vs. 24% F Do we have discrimination here?
36
Discrimination Discovery Assuming a causal Bayesian network that faithfully represents the data. Discriminatory effect if ∆P > τ, where τ is a threshold for discrimination depending on law (e.g., 5%). Protected attribute Decision attribute c+, c-e+, e- ∆P = P(e+|c+) − P(e+|c−)
37
Motivate Examples Case I Case II ∆P = 0.1 ∆P = -0.01
38
Motivate Examples Case II Case III ∆P + + ∆P - + ∆P = -0.01 ∆P = 0.104
39
Discrimination Analysis Discrimination is treatment or consideration of, or making a distinction in favor of or against, a person or thing based on the group, class, or category to which that person or thing is perceived to belong to rather than on individual merit. (Wikipedia) Tweets discrimination analysis aims to detect whether a tweet contains discrimination against gender, race, age, etc.
40
A Typical Deep Learning Pipeline for Text Classification Text Word Representation Multilayer Perception Recursive Neural Network Recurrent Neural Network Convolutional Neural Network Deep Learning Model Text Representation Softmax Classifier word semantic compositiontext Text Representation
41
Word Embeddings Tweet …
42
Word Embeddings Tweet … … LSTM-RNN …
43
Word Embeddings Tweet … … LSTM-RNN … Tweet Representation Mean Pooling
44
Word Embeddings Tweet … … LSTM-RNN … Mean Pooling Logistic Regression Tweet Representation
45
Summary 1. Preserving Privacy Values 2. Educating Robustly and Responsibly 3. Big Data and Discrimination 4. Law Enforcement & Security 5. Data as a Public Resouce 45
46
Acknowledgement 46 Collaborators: UNCC: Aidong Lu, Xinghua Shi, Yong Ge Oregon: Jun Li, Dejing Dou PeaceHealth: Brigitte Piniewski UIUC: Tao Xie DPL members: UNCC: PhD graduates: Songtao Guo, Ling Guo, Kai Pan, Leting Wu, Xiaowei Ying. PhD students: Yue Wang, Yuemeng Li, Zhilin Luo (visiting) UofA: Lu Zhang (postdoc), Yongkai Wu, Cheng Si, Miao Xie, Shuhan Yuan Funding support:
47
Genome Wide Association Study 47
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.