1 Progress on the PORTIA Project JOAN FEIGENBAUM June 5, 2006; Google; New York NY.

Slides:



Advertisements
Similar presentations
Revisiting the efficiency of malicious two party computation David Woodruff MIT.
Advertisements

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
1 Education and Outreach in the PORTIA Project Joan Feigenbaum PORTIA Project Site Visit Stanford CA, May 12-13, 2005.
CS555Topic 241 Cryptography CS 555 Topic 24: Secure Function Evaluation.
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013.
Lakshmi Narayana Gupta Kollepara 10/26/2009 CSC-8320.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Yan Huang, Jonathan Katz, David Evans University of Maryland, University of Virginia Efficient Secure Two-Party Computation Using Symmetric Cut-and-Choose.
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
1 i206: Distributed Computing Applications & Infrastructure 2012
CSCE 715 Ankur Jain 11/16/2010. Introduction Design Goals Framework SDT Protocol Achievements of Goals Overhead of SDT Conclusion.
1 WRAP UP Joan Feigenbaum PORTIA Project Site Visit Stanford CA, May 12-13, 2005.
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
CMSC 414 Computer and Network Security Lecture 21 Jonathan Katz.
1 The PORTIA Project: Research Overview Dan Boneh PORTIA Project Site Visit Stanford CA, May 12-13, 2005
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
1 Progress on the PORTIA Project JOAN FEIGENBAUM March 21, 2005; Rutgers.
Performance Evaluation
Malicious parties may employ (a) structure-based or (b) label-based attacks to re-identify users and thus learn sensitive information about their rating.
1 Introduction to Secure Computation Benny Pinkas HP Labs, Princeton.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Lars Arge 1/12 Lars Arge. 2/12  Pervasive use of computers and sensors  Increased ability to acquire/store/process data → Massive data collected everywhere.
1 Approximate Privacy: Foundations and Quantification Joan Feigenbaum Northwest Univ.; May 20, 2009 Joint work with A.
Privacy Preserving Learning of Decision Trees Benny Pinkas HP Labs Joint work with Yehuda Lindell (done while at the Weizmann Institute)
Control of Personal Information in a Networked World Rebecca Wright Boaz Barak Jim Aspnes Avi Wigderson Sanjeev Arora David Goodman Joan Feigenbaum ToNC.
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
Software Development Unit 2 Databases What is a database? A collection of data organised in a manner that allows access, retrieval and use of that data.
Current Developments in Differential Privacy Salil Vadhan Center for Research on Computation & Society School of Engineering & Applied Sciences Harvard.
Secure Computation of Surveys Joan Feigenbaum Benny Pinkas Raphael S. Ryger Felipe Saint Jean Workshop on Secure Multiparty Protocols (SMP 2004)
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
1 Privacy-Preserving Distributed Information Sharing Nan Zhang and Wei Zhao Texas A&M University, USA.
Secure Computation of the k’th Ranked Element Gagan Aggarwal Stanford University Joint work with Nina Mishra and Benny Pinkas, HP Labs.
Purpose of study A high-quality computing education equips pupils to use computational thinking and creativity to understand and change the world. Computing.
ITEC224 Database Programming
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
Advanced Software Engineering PROJECT. 1. MapReduce Join (2 students)  Focused on performance analysis on different implementation of join processors.
Wireless Networks Breakout Session Summary September 21, 2012.
Privacy Preserving Data Sharing With Anonymous ID Assignment
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Privacy-Aware Personalization for Mobile Advertising
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Hiding in the Mobile Crowd: Location Privacy through Collaboration.
Secure two-party computation: a visual way by Paolo D’Arco and Roberto De Prisco.
1 Privacy Preserving Data Mining Haiqin Yang Extracted from a ppt “Secure Multiparty Computation and Privacy” Added “Privacy Preserving SVM”
Palette: Distributing Tables in Software-Defined Networks Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Welcome to RPI CS! Theory Group Professors: Mark Goldberg Associate Professors: Daniel Freedman, Mukkai Krishnamoorthy, Malik Magdon- Ismail, Bulent Yener.
Privacy Preserving Payments in Credit Networks By: Moreno-Sanchez et al from Saarland University Presented By: Cody Watson Some Slides Borrowed From NDSS’15.
Anonymity and Privacy Issues --- re-identification
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
1 Privacy and Accountability: Introduction to Workshop Themes JOAN FEIGENBAUM June 28, 2006; Cambridge MA.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
Operating Systems CMPSC 473 Introduction and Overview August 24, Lecture 1 Instructor: Bhuvan Urgaonkar.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Differential Privacy in Practice
Current Developments in Differential Privacy
Objective of This Course
MANAGING DATA RESOURCES
Introduction to Information Retrieval
Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Kriti shreshtha.
Helen: Maliciously Secure Coopetitive Learning for Linear Models
Presentation transcript:

1 Progress on the PORTIA Project JOAN FEIGENBAUM June 5, 2006; Google; New York NY

2 PORTIA: Privacy, Obligations, and Rights in Technologies of Information Assessment Large-ITR, five-year, multi- institutional, multi-disciplinary, multi-modal research project on sensitive information in a networked world

3 Motivation Sensitive Information: Info that can harm data subjects, data owners, or data users if it is mishandled. Not all of it is strictly “private.” There’s a lot more of it than there used to be! –Increased use of computers and networks –Increased processing power and algorithmic knowledge –Decreased storage costs  Google has “shifted the boundary between ‘self’ and ‘other’.” “Mishandling” can be very harmful. −ID theft −Loss of employment or insurance −“You already have zero privacy. Get over it.” (Scott McNealy, 1999)

4 PORTIA Goals Produce a next generation of technology for handling sensitive information that is qualitatively better than the current generation’s. Enable handling of sensitive information over the course of its lifetime. Formulate an effective conceptual framework for policy making and philosophical inquiry into the rights and responsibilities of data subjects, data owners, and data users.

5 Academic–CS Participants Stanford Dan Boneh Hector Garcia-Molina John Mitchell Rajeev Motwani Yale Joan Feigenbaum Ravi Kannan Avi Silberschatz Univ. of NM Stevens NYU Stephanie Forrest Rebecca Wright Helen Nissenbaum (“computational immunology”) (“value-sensitive design”)

6 Research Partners J. Balkin (Yale Law School) C. Dwork (Microsoft) S. Hawala (Census Bureau) B. LaMacchia (Microsoft) K. McCurley (Google) P. Miller (Yale Medical School) J. Morris (CDT) T. Pageler (Secret Service) B. Pinkas (Hewlett Packard) M. Rotenberg (EPIC) A. Schäffer (NIH) D. Schutzer (CitiGroup) Note participation by the software industry, key user communities, advocacy organizations, and non-CS academics.

7 PORTIA Work at Yale includes: Massive-Data-Set Algorithmics (K. Chang, P. Drineas, N. Feamster, J. Feigenbaum, R. Kannan, S. Kannan, M. Mahoney, A. McGregor, J. Rexford, S. Suri, J. Zhang) Database Engines for Biosciences (J. Corwin, P. Miller, A. Silberschatz, S. Yadlapalli) Privacy-Preserving Data Mining (D. Boneh, A. Bortz, J. Feigenbaum, O. Kardes, B. Pinkas, R. Ryger, F. Saint-Jean, R. Wright, Z. Yang) Legal Foundations (J. Balkin, J. Feigenbaum, M. Godwin, E. Katz, N. Kozlovski)

8 Comput’l. Model for Massive Data Streaming/Pass-Efficient Model: The input consists of an array of data. The order may be adversarial. Data may only be accessed in a few, sequential passes over the data. The algorithm is allowed a small amount of working memory to perform calculations. Goals: Small working space and few passes Working Space STREAM

9 Motivation for the Streaming Model The amount of data arriving “online” is so overwhelming that one cannot store it all. Thus, it makes sense to consider what can be accomplished in one pass over a datastream. I/O to secondary (disk) or tertiary (DVDs, tapes) storage dominates compute time. Disks have heavy penalties for random access (seek times and rotational delay). Thus, it makes sense to optimize throughput for sequential access. Conceptually, it makes sense to distinguish between memory (small but supports fast random access) and storage devices (large but does not support fast random access).

10 Stream Algorithms for Massive Graphs (Feigenbaum, S. Kannan, McGregor, Suri, and Zhang) A graph with n nodes and m edges is presented as a stream of edges. Very little can be done when the algorithms are limited to o(n) (working) space. Algorithms using n polylog(n) space for, e.g.,: –Approximate matching –Approximate all-pairs shortest-path distances Some massive-graph problems require multiple passes in the streaming model.

11 Approx. Massive-Matrix Computations (Drineas, R. Kannan, and Mahoney) Approximate by sampling the rows and the columns of the matrices. Goals are fast running time and few passes over the matrices. Few-pass algorithms for, e.g.,: –Approximate matrix multiplication –Computing a low-rank approximation of a matrix –Approximating a compressed matrix decomposition

12 Pass-Efficient Learning of Distribs. (Chang and R. Kannan) The input is a stream of samples from a distribution F, ordered arbitrarily. The goal is to reconstruct F. For certain classes of distributions: F can be learned with error ε in P passes using space O(k 3 /ε 2/P ), where k is a parameter describing the “complexity” of the distribution. Any randomized P-pass algorithm will need at least 1/ε 1/(2P-1) bits of memory to learn a distribution with 3 “steps.”

13 PORTIA Work at Yale includes: Massive-Data-Set Algorithmics (K. Chang, P. Drineas, N. Feamster, J. Feigenbaum, R. Kannan, S. Kannan, M. Mahoney, A. McGregor, J. Rexford, S. Suri, J. Zhang) Database Engines for Biosciences (J. Corwin, P. Miller, A. Silberschatz, S. Yadlapalli) Privacy-Preserving Data Mining (D. Boneh, A. Bortz, J. Feigenbaum, O. Kardes, B. Pinkas, R. Ryger, F. Saint-Jean, R. Wright, Z. Yang) Legal Foundations (J. Balkin, J. Feigenbaum, M. Godwin, E. Katz, N. Kozlovski)

14 Database Engines for Biosciences (Corwin, Miller, Silberschatz, Yadlapalli) Biological and medical datasets frequently are not handled well by conventional relational- database systems. –Heterogeneity (at the schema level and at the data level) –Sparse data The PORTIA goal is to add capabilities to a relational-database system to better support these types of data.

15 Novel Tools for Access and Queries A rule-based system that creates a homogeneous view by mapping among heterogeneous schemata A postgreSQL-based query engine tailored for sparse biosciences data. (Key idea: Store data attributes in a collection of one-column tables, joined by their row IDs. The row ID becomes the primary key of the attribute table.)

16 PORTIA Work at Yale includes: Massive-Data-Set Algorithmics (K. Chang, P. Drineas, N. Feamster, J. Feigenbaum, R. Kannan, S. Kannan, M. Mahoney, A. McGregor, J. Rexford, S. Suri, J. Zhang) Database Engines for Biosciences (J. Corwin, P. Miller, A. Silberschatz, S. Yadlapalli) Privacy-Preserving Data Mining (D. Boneh, A. Bortz, J. Feigenbaum, O. Kardes, B. Pinkas, R. Ryger, F. Saint-Jean, R. Wright, Z. Yang) Legal Foundations (J. Balkin, J. Feigenbaum, M. Godwin, E. Katz, N. Kozlovski)

17 Privacy-preserving Data Mining Is this an oxymoron? No! Cryptographic theory is extraordinarily powerful, almost paradoxically so. Computing exactly one relevant fact about a distributed data set while concealing everything else is exactly what cryptographic theory enables in principle. But not (yet!) in practice.

18 Secure, Multiparty Function Evaluation... x1x1 x2x2 x 3x 3 x n-1 x nx n y = F (x 1, …, x n ) Each i learns y. No i can learn anything about x j (except what he can infer from x i and y ). Very general positive results. Not very efficient.

19 PORTIA-Supported Work at Yale on SMFE Protocols Wright and Yang: Efficient 2-party protocol for K2 Bayes-net construction on vertically partitioned data Kardes, Ryger, Wright, and Feigenbaum: Experimental evaluation of the Wright- Yang protocol; “coordination framework” for SMFE-protocol implementation Boneh, Bortz, Feigenbaum, and Saint- Jean: Private-Information-Retrieval library with MySQL support

20 Secure Computation of Surveys Joan Feigenbaum (Yale), B. Pinkas (HP), R. Ryger (Yale), and F. Saint-Jean (Yale) ppt} crypto.stanford.edu/portia/software/taulbee.html

21 Surveys and other Naturally Centralized Multiparty Computations Consider –Sealed-bid auctions –Elections –Referenda –Surveys Each participant weighs the hoped-for payoffs against any revelation penalty (“loss of privacy”) and is concerned that the computation be fault-free and honest. The implementor, in control of the central computation, must configure auxiliary payoffs and privacy assurances to encourage (honest) participation.

22 CRA Taulbee Survey: Computer Science Faculty Salaries Computer science departments in four tiers, all the rest Academic faculty in four ranks: full, associate, and assistant professors, and non-tenure-track teaching faculty Intention: Convey salary distribution statistics per tier-rank to the community at large without revealing department-specific information.

23 CRA Taulbee Survey: The Current Computation Inputs, per department and faculty rank: –Minimum –Maximum –Median –Mean Outputs, per tier and faculty rank: –Minimum, maximum, and mean of department minima –Minimum, maximum, and mean of department maxima –Median of department means (not weighted) –Mean (weighted mean of department means)

24 Taulbee Survey: The Problem CRA wishes to provide fuller statistics than the meager data currently collected can support. The current level of data collection already compromises department-specific information. Asking for submission of full faculty-salary information greatly raises the threshold for trust in CRA's intentions and its security competence. Detailed disclosure, even if anonymized, may be explicitly prohibited by the school. Hence, there is a danger of significant non- participation in the Taulbee Survey.

25 Communication Pattern: General SMFE Protocols

26 Communication Pattern: Surveys and Other Trusted-Party Computations

27 Communication Pattern: M-for-N-Party SMFE

28 Our Implementation: Input-Collection Phase Secure input collection: –Salary and rank data entry by department representative –Per rank, in JavaScript, computation of XOR shares of the individual salaries for the two (M = 2 ) computation servers –Per rank, HTTPS transmission of XOR shares to their respective computation servers Note that cleartext data never leave the client machine.

29 Our Implementation: Computation Phase Per tier and rank, construction of a Boolean circuit to –reconstruct inputs by XOR-ing their shares –sort the inputs in an odd-even sorting network Secure computation, per tier and rank: –Fairplay [Malkhi et al., 2004] implementation of the Yao 2-party SFE protocol for the constructed circuit and the collected input shares –Output is a sorted list of all salaries in the tier-rank. Postprocessing, per tier and rank: –arbitrary, statistical computation on the sorted, cross- departmental salary list

30 The Heartbreak of Cryptography User-friendly, open-source, free implementation NO ADOPTION CRA’s reasons  Need for data cleaning and multiyear comparisons –Perhaps most member departments will trust us. Yale Provost’s Office’s reasons  No legal basis for using this privacy-preserving protocol on data that we otherwise don’t disclose  Correctness and security claims are hard and expensive to assess, despite open-source implementation.  All-or-none adoption by Ivy+ peer group.

31 See PORTIA Website for: Papers, talks, and software Educational activities –Courses –Grad students and postdocs Media coverage Programs and slides from workshops Related links [ Google “PORTIA project” ]

32 Preliminary Conclusions Less and less sensitive information is truly inaccessible. The question is the cost of access, and that cost is decreasing. Foundational legal theories to support obligations and rights in cyberspace are lacking. Technological progress is still going strong, almost 30 years after Diffie-Hellman, but adoption is slow. ?Next step: Find a community of data owners who need the results of joint computations and can’t get them without SMFE. (Medical researchers?)

33 Past Collaboration with Google Dan Boneh, Collin Jackson, and others at Stanford Browser-based, client-side identity protection See PORTIA software page, or go directly to – –

34 Potential for Further Joint Work Advertiser-supported, privacy-preserving web browsing Personalization/efficiency vs. privacy Massive-data-set algorithmics (including massive-graph algorithmics) Algorithmic mechanism design to combat “pollution of the information environment” Leveraging the search-auction infrastructure through “protocol-based algorithm design”