Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships Michael T. Goodrich Univ. of California, Irvine joint w/ Arthur.

Slides:



Advertisements
Similar presentations
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Advertisements

Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.
Knapsack Problem Section 7.6. Problem Suppose we have n items U={u 1,..u n }, that we would like to insert into a knapsack of size C. Each item u i has.
An Ω(n 1/3 ) Lower Bound for Bilinear Group Based Private Information Retrieval Alexander Razborov Sergey Yekhanin.
Fast Algorithms For Hierarchical Range Histogram Constructions
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Genetic Algorithms Sushil J. Louis Evolutionary Computing Systems LAB Dept. of Computer Science University of Nevada, Reno
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
“Random Projections on Smooth Manifolds” -A short summary
Informed Content Delivery Across Adaptive Overlay Networks J. Byers, J. Considine, M. Mitzenmacher and S. Rost Presented by Ananth Rajagopala-Rao.
CHAPTER 4 Decidability Contents Decidable Languages
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
LEARNING DECISION TREES
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
P OPULATION -B ASED I NCREMENTAL L EARNING : A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning 吳昕澧 Date:2011/07/19.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Machine Learning as Applied to Intrusion Detection By Christine Fossaceca.
Software Process and Product Metrics
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
1 Fingerprinting techniques. 2 Is X equal to Y? = ? = ?
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Discrete Mathematical Structures (Counting Principles)
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.
Chapter 10 Applications of Arrays and Strings. Chapter Objectives Learn how to implement the sequential search algorithm Explore how to sort an array.
Combinatorial Algorithms Reference Text: Kreher and Stinson.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
Attacking Data Stores Brad Stancel CSCE 813 Presentation 11/12/2012.
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Nonunique Probe Selection and Group Testing Ding-Zhu Du.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
SOFIANE ABBAR, HABIBUR RAHMAN, SARAVANA N THIRUMURUGANATHAN, CARLOS CASTILLO, G AUTAM DAS QATAR COMPUTING RESEARCH INSTITUTE UNIVERSITY OF TEXAS AT ARLINGTON.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
GIS Data Models GEOG 370 Christine Erlien, Instructor.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Digital Image Processing Lecture 22: Image Compression
Chapter 6 Lecture 3 Sections: 6.4 – 6.5. Sampling Distributions and Estimators What we want to do is find out the sampling distribution of a statistic.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
CS4432: Database Systems II
Computational Molecular Biology
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
All Your Queries are Belong to Us: The Power of File-Injection Attacks on Searchable Encryption Yupeng Zhang, Jonathan Katz, Charalampos Papamanthou University.
Computational Molecular Biology Pooling Designs – Inhibitor Models.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
The Acceptance Problem for TMs
Digital Image Processing Lecture 20: Image Compression May 16, 2005
Chapter 2: Relational Model
Computational Molecular Biology
Distributed Submodular Maximization in Massive Datasets
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Sudocodes Fast measurement and reconstruction of sparse signals
CSCI N317 Computation for Scientific Applications Unit Weka
CS 485G: Special Topics in Data Mining
ECE 352 Digital System Fundamentals
Clustering.
Presentation transcript:

Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships Michael T. Goodrich Univ. of California, Irvine joint w/ Arthur U. Asuncion

Problem Definition Discover the friendships

Problem Definition Discover the friendships

Leveraging Information Leaks Leak: Friendship list can be viewed by friends-of-friends. This allows: –Given two people, X and Y, we can tell whether X and Y have a friend in common. Leverage: We use this to discover the friends list for members of the network

Abstracting the Problem Viewed abstractly, we are trying to learn binary attribute vectors.

Group Testing Input: n items, numbered 0,1, …, n-1, at most d of which are defective. Output: the indices of the defective items. Items can be grouped into subsets, each of which can be tested to see it contains a defective item or not. Goal: minimize the total number of tests Original problem: Testing blood samples.

Testing Schemes Non-adaptive: All tests must be done in parallel Adaptive: Tests can be done sequentially Adaptive is easier, but our framework requires a non- adaptive approach

Facebook Application Each member has a “vector” of friendships For any member M, the system returns a bit for whether M has a friend in common with the attacker, even if M restricts this information to friends-of-friends We can use non-adaptive scheme to learn friendship relationships in any sub-community in Facebook.

DNA Application DNA sequences are stored in a database, D. For any sequence Q, the database returns a score for how close Q is to each sequence in D We form a binary vector w.r.t. places where mutations happen relative to a reference string R We can use non-adaptive scheme to learn DNA strings in D.

Netflix Application Movie ratings vectors are stored in a database, D. For any vector V, the database returns a score for how close V is to each vector in the database We can form a binary attribute vector for movies We can use non-adaptive scheme to learn ratings vectors in D.

Matrix View of Testing A non-adaptive testing regimen can be viewed as a t x n binary matrix M: –M[i,j] = 1 if and only if test i includes item j M is d-disjunct if the Boolean sum of any d columns does not contain any other column. –An item is defective iff all its tests are positive M is d-separable if the Boolean sums of each set of at most d columns are distinct (harder analysis algorithm) t n M

Randomized Approach Use a randomized approach motivated by Bloom filtering. Construct a matrix M, but relax requirements Given a set D of d columns in M and a column j, say j is distinguishable from D if there is a row i such that M[i,j]=1 but M[i,j’]=0 for each j’ in D. M is D -distinguishable if, for a particular collection D of subsets, the matrix M will find them distinguishable.

Constructing the Matrix Given t (set in the analysis), let M be a 2t x n matrix defined randomly: –For each column j, choose t/d rows of M at random and set these entries to 1. –that is, we “inject” j into those t/d tests

Technique for Social Networks Insert a small set of network members Form connections with random network members Test common- friends condition for the fictional members Image from

Exploiting Sparse Data Sets Histogram of differences from R: Table of sizes, lengths, and differences from R:

Number of Tests Needed in Theory 1 st column: To clone entire database with high probability 2 nd column: To clone sparsest 50% of database with high probability 3 rd column: To clone entire database with probability 1

Different Choices for “d” Tradeoff: –The smaller the “d”, the faster we can recover sparse vectors –With very small “d”, it can take a long time to recover the vectors that are not so sparse. But most vectors are sparse so we generally want a pretty small “d” Attack on a Netflix user who has rated 98 movies. With smaller “d”, the rate of convergence is faster.

Different choices for “d” Here we vary “d” on the x-axis and we plot the mean and median number of tests required across the vectors in the database.

Distance from R More tests are needed for vectors which are further from the reference R (but note most vectors are close to R). We also see the tradeoff between various “d”

Thresholding Behavior There are critical values of our estimated value for d:

Conclusion and Future Work We have presented a way to turn privacy leaks into floods, with a number of applications: –Social networks –DNA databases –Ratings vectors Future work: extend our approach to non-binary vectors (e.g., friends and foes)