Detecting Data Leakage Panagiotis Papadimitriou Hector Garcia-Molina

Slides:



Advertisements
Similar presentations
PRG for Low Degree Polynomials from AG-Codes Gil Cohen Joint work with Amnon Ta-Shma.
Advertisements

Nonparametric estimation of non- response distribution in the Israeli Social Survey Yury Gubman Dmitri Romanov JSM 2009 Washington DC 4/8/2009.
Chapter 11 Other Chi-Squared Tests
Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.
Design Guidelines for Maximizing Lifetime and Avoiding Energy Holes in Sensor Networks with Uniform Distribution and Uniform Reporting Stephan Olariu Department.
Self-Organized Anonymous Authentication in Mobile Ad Hoc Networks Julien Freudiger, Maxim Raya and Jean-Pierre Hubaux SECURECOMM, 2009.
Small Subgraphs in Random Graphs and the Power of Multiple Choices The Online Case Torsten Mütze, ETH Zürich Joint work with Reto Spöhel and Henning Thomas.
Vishal Patil Paresh Rawat Pratik Nikam Satish Patil By: Under The Guidance Of Prof.Rucha Samant.
How Much Anonymity does Network Latency Leak? Paper by: Nicholas Hopper, Eugene Vasserman, Eric Chan-Tin Presented by: Dan Czerniewski October 3, 2011.
A Survey of Trust Evaluation Methods (Supervisor: Yan Wang) Name:Erden Sacan Student ID: Unit:ITEC810.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Simrank++: Query Rewriting through link analysis of the click graph Ioannis Antonellis Hector Garcia-Molina
YAPPERS: A Peer-to-Peer Lookup Service over Arbitrary Topology Qixiang Sun Prasanna Ganesan Hector Garcia-Molina Stanford University.
Mobility Improves Coverage of Sensor Networks Benyuan Liu*, Peter Brass, Olivier Dousse, Philippe Nain, Don Towsley * Department of Computer Science University.
A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation Yinlong Xu University of Science and Technology of.
1 Replication Strategies in Unstructured Peer-to-Peer Networks Edith Cohen, Scott Shenker ACM SIGCOMM Computer Communication Review, Proceedings of the.
Sequences of Take-It-or-Leave-it Offers: Near-Optimal Auctions Without Full Valuation Revelation Tuomas Sandholm and Andrew Gilpin Carnegie Mellon University.
Geographic Gossip: Efficient Aggregations for Sensor Networks Author: Alex Dimakis, Anand Sarwate, Martin Wainwright University: UC Berkeley Venue: IPSN.
INFM 718A / LBSC 705 Information For Decision Making Lecture 6.
1 Maximizing Remote Work in Flooding-based P2P Systems Qixiang Sun Neil Daswani Hector Garcia-Molina Stanford University.
Theta Function Lecture 24: Apr 18. Error Detection Code Given a noisy channel, and a finite alphabet V, and certain pairs that can be confounded, the.
Small Subgraphs in Random Graphs and the Power of Multiple Choices The Online Case Torsten Mütze, ETH Zürich Joint work with Reto Spöhel and Henning Thomas.
Near-Optimal Network Design With Selfish Agents Elliot Anshelevich, Anirban Dasgupta, Éva Tardos, Tom Wexler STOC’03, June 9–11, 2003, San Diego, California,
Collecting Correlated Information from a Sensor Network Micah Adler University of Massachusetts, Amherst.
1 Towards an end-to-end architecture for handling sensitive data Hector Garcia-Molina Rajeev Motwani and students.
Chapter 5 Data mining : A Closer Look.
Registration Satisfaction Survey FAS Report, Fall Presented by: K. El Hassan, PhD. Director, OIRA.
The table shows a random sample of 100 hikers and the area of hiking preferred. Are hiking area preference and gender independent? Hiking Preference Area.
The Math Studies Project for Internal Assessment A good project should be able to be followed by a non-mathematician and be self explanatory all the way.
TDAQ ATLAS Reimplementation of the ATLAS Online Event Monitoring Subsystem Ingo Scholtes Summer Student University of Trier Supervisor: Serguei Kolos.
01-Feb-12Data Leakage Detection1. CONTENTS  ABSTRACT  INTRODUCTION  OBJECTIVES  STUDY AND ANALYSIS  FLOW CHART  FUTURE SCOPE  LIMITATIONS  APPLICATIONS.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
1 Inference for Categorical Data William P. Wattles, Ph. D. Francis Marion University.
Sequences of Take-It-or-Leave-it Offers: Near-Optimal Auctions Without Full Valuation Revelation Tuomas Sandholm and Andrew Gilpin Carnegie Mellon University.
Small subgraphs in the Achlioptas process Reto Spöhel, ETH Zürich Joint work with Torsten Mütze and Henning Thomas TexPoint fonts used in EMF. Read the.
Comparing Hybrid Peer-Peer Systems Beverly Yang Hector Garcia-Molina Stanford University Presented by Kalyan Boggavarapu.
Data Leakage Detection by Akshay Vishwanathan ( ) Joseph George ( ) S. Prasanth ( ) Guided by: Ms. Krishnapriya.
Understanding Cross-site Linking in Online Social Networks Yang Chen 1, Chenfan Zhuang 2, Qiang Cao 1, Pan Hui 3 1 Duke University 2 Tsinghua University.
Introduction to: 1.  Goal[DEN83]:  Provide frequency, average, other statistics of persons  Challenge:  Preserving privacy[DEN83]  Interaction between.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
CS527 Topics in Software Engineering (Software Testing and Analysis) Darko Marinov September 16, 2010.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
Sex Research How do we learn new information about sex? © Robert J. Atkins, Ph.D.
Sampling for Part Based Object Models Daniel Huttenlocher September, 2006.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
POSC 202A: Lecture 4 Probability. We begin with the basics of probability and then move on to expected value. Understanding probability is important because.
Probability Rules. We start with four basic rules of probability. They are simple, but you must know them. Rule 1: All probabilities are numbers between.
Probability Formulas The probability of more than one outcome. This is a union of the probabilities. If events are disjoint: If events are not disjoint:
The table shows a random sample of 100 hikers and the area of hiking preferred. Are hiking area preference and gender independent? Hiking Preference Area.
MCMC in structure space MCMC in order space.
 Practicum 1: Surveys February 14, 2013 Business 306 Vinny DeLorenzo Kevin Durham Matt Welling Mike Gross.
Data Leakage Detection by R.Kartheek Reddy 09C31D5807 (M.Tech CSE)
CS 590 Term Project Epidemic model on Facebook
MySpace & Facebook By Veronica Baca. MySpace Tom Anderson August 2003 Social Networking Website Free service Required Age: 14 & over A virtual community.
Avoiding small subgraphs in the Achlioptas process Torsten Mütze, ETH Zürich Joint work with Reto Spöhel and Henning Thomas TexPoint fonts used in EMF.
Closed Capture-Recapture Models 2 Sample Model Outline: Model description/ data structure Encounter history Estimators Assumptions and study design.
Cover slide Project, course, team, date. Outline welcome  1 slide introducing the key sections/ information items in this presentation.
 Occupancy Model Extensions. Number of Patches or Sample Units Unknown, Single Season So far have assumed the number of sampling units in the population.
Data Leakage Detection Major Project Report Submitted by Ankit Kumar Tater:08J41A1206 Y. V. Pradeep Kumar Reddy:08J41A1235 Pradeep Saklani:08J41A1236 Under.
Chi Square Test Dr. Asif Rehman.
Ch3: Model Building through Regression
Finding Communities by Clustering a Graph into Overlapping Subgraphs
Model Estimation and Comparison Gamma and Lognormal Distributions
Tagging with Queries: How and Why?
Stats for AP Biology SLIDE SHOWS MODIFIED FROM:
Clustering Using Pairwise Comparisons
Inference for Categorical Data
Farzaneh Mirzazadeh Fall 2007
How do NRC Students use Social Media?
Presentation transcript:

Detecting Data Leakage Panagiotis Papadimitriou Hector Garcia-Molina

Leakage Problem Stanford Infolab2 App. U 1 App. U 2 JeremySarahMark Other Sources e.g. Sarah’s Network Name: Mark Sex: Male …. Name: Sarah Sex: Female …. Kathryn

Outline Problem Description Guilt Models – Pr{U 1 leaked data} = 0.7 – Pr{U 2 leaked data} = 0.2 Distribution Strategies Stanford Infolab3

Problem Description Guilt Models Distribution Strategies Stanford Infolab4

Problem Entities EntityDataset Distributor Facebook T Set of all Facebook profiles Agents Facebook Apps U 1, …, U n R 1, …, R n R i : Set of people’s profiles who have added the application U i Leaker S Set of leaked profiles Stanford Infolab5

Agents’ Data Requests Sample – 100 profiles of Stanford people Explicit – All people who added application (example we used so far) – All Stanford profiles Stanford Infolab6

Problem Description Guilt Models Distribution Strategies Stanford Infolab7

Guilt Models (1/3) Stanford Infolab8 Other Sources e.g. Sarah’s Network 8 p p: posterior probability that a leaked profile comes from other sources p Guilty Agent: Agent who leaks at least one profile Pr{G i |S}: probability that agent U i is guilty, given the leaked set of profiles S

Guilt Models (2/3) Stanford Infolab99 or Agents leak each of their data items independently Agents leak all their data items OR nothing or (1-p) 2 (1-p)p p(1-p) p2p2

Guilt Models (3/3) IndependentlyNOT Independently Stanford Infolab10 Pr{G 1 } Pr{G 2 } Pr{G 1 }

Problem Description Guilt Models Distribution Strategies Stanford Infolab11

The Distributor’s Objective (1/2) Stanford Infolab12 U1U1 U1U1 U2U2 U2U2 U3U3 U3U3 U4U4 U4U4 Request R1R1 Pr{G 1 |S}>>Pr{G 2 |S} Pr{G 1 |S}>> Pr{G 4 |S} S (leaked) R1R1 R1R1 R3R3 R3R3 R2R2 R3R3 R4R4

The Distributor’s Objective (2/2) To achieve his objective the distributor has to distribute sets R i, …, R n that minimize Intuition: Minimized data sharing among agents makes leaked data reveal the guilty agents Stanford Infolab13

Distribution Strategies – Sample (1/4) Set T has four profiles: – Kathryn, Jeremy, Sarah and Mark There are 4 agents: – U 1, U 2, U 3 and U 4 Each agent requests a sample of any 2 profiles of T for a market survey Stanford Infolab14

Distribution Strategies – Sample (2/4) Poor Minimize Stanford Infolab15 U1U1 U2U2 U3U3 U4U4 U1U1 U2U2 U3U3 U4U4

Distribution Strategies – Sample (3/4) Optimal Distribution Avoid full overlaps and minimize Stanford Infolab16 U1U1 U2U2 U3U3 U4U4

Distribution Strategies – Sample (4/4) Stanford Infolab17

Distribution Strategies Sample Data Requests The distributor has the freedom to select the data items to provide the agents with General Idea: – Provide agents with as much disjoint sets of data as possible Problem: There are cases where the distributed data must overlap E.g., |R i |+…+|R n |>|T| Explicit Data Requests The distributor must provide agents with the data they request General Idea: – Add fake data to the distributed ones to minimize overlap of distributed data Problem: Agents can collude and identify fake data NOT COVERED in this talk Stanford Infolab18

Conclusions Data Leakage Modeled as maximum likelihood problem Data distribution strategies that help identify the guilty agents Stanford Infolab19

Thank You!