Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Chapter3 Pattern Association & Associative Memory
Advertisements

I have a DREAM! (DiffeRentially privatE smArt Metering) Gergely Acs and Claude Castelluccia {gergely.acs, INRIA 2011.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Transportation Problem (TP) and Assignment Problem (AP)
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Dynamic Hypercube Topology Stefan Schmid URAW 2005 Upper Rhine Algorithms Workshop University of Tübingen, Germany.
Distributed Cluster Repair for OceanStore Irena Nadjakova and Arindam Chakrabarti Acknowledgements: Hakim Weatherspoon John Kubiatowicz.
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
Artificial Neural Networks
A Local Facility Location Algorithm Supervisor: Assaf Schuster Denis Krivitski Technion – Israel Institute of Technology.
Distributed Combinatorial Optimization
The Byzantine Generals Strike Again Danny Dolev. Introduction We’ll build on the LSP presentation. Prove a necessary and sufficient condition on the network.
1 Fault Tolerance in Collaborative Sensor Networks for Target Detection IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 3, MARCH 2004.
Interval Routing Presented by: Marc Segal. Motivation(1) In a computer network a routing method is required so that nodes can communicate with each other.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Secure Message Transmission In Asynchronous Directed Networks Kannan Srinathan, Center for Security, Theory and Algorithmic Research, IIIT-Hyderabad. In.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
1 Hybrid methods for solving large-scale parameter estimation problems Carlos A. Quintero 1 Miguel Argáez 1 Hector Klie 2 Leticia Velázquez 1 Mary Wheeler.
Artificial Neural Networks
Network Aware Resource Allocation in Distributed Clouds.
Equality Function Computation (How to make simple things complicated) Nitin Vaidya University of Illinois at Urbana-Champaign Joint work with Guanfeng.
Machine Learning Chapter 4. Artificial Neural Networks
Message-Passing for Wireless Scheduling: an Experimental Study Paolo Giaccone (Politecnico di Torino) Devavrat Shah (MIT) ICCCN 2010 – Zurich August 2.
Maximum Flow Problem (Thanks to Jim Orlin & MIT OCW)
Multi-area Nonlinear State Estimation using Distributed Semidefinite Programming Hao Zhu October 15, 2012 Acknowledgements: Prof. G.
Network Lasso: Clustering and Optimization in Large Graphs
15.082J and 6.855J March 4, 2003 Introduction to Maximum Flows.
1 Distributed Resilient Consensus Nitin Vaidya University of Illinois at Urbana-Champaign.
Iterative Byzantine Vector Consensus in Incomplete Graphs Nitin Vaidya University of Illinois at Urbana-Champaign ICDCN presentation by Srikanth Sastry.
Consensus Problems in Networks Aman Agarwal EYES 2007 intern Advisor Prof. Mostofi ECE, University of New Mexico July 5, 2007.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
Byzantine Vector Consensus in Complete Graphs Nitin Vaidya University of Illinois at Urbana-Champaign Vijay Garg University of Texas at Austin.
SYNERGY: A Game-Theoretical Approach for Cooperative Key Generation in Wireless Networks Jingchao Sun, Xu Chen, Jinxue Zhang, Yanchao Zhang, and Junshan.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Impact of Interference on Multi-hop Wireless Network Performance
The role of optimization in machine learning
Deep Feedforward Networks
Stochastic Streams: Sample Complexity vs. Space Complexity
Signal processing and Networking for Big Data Applications: Lecture 9 Mix Integer Programming: Benders decomposition And Branch & Bound NOTE: To change.
Distributed Systems – Paxos
Introduction to Wireless Sensor Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
The minimum cost flow problem
University of Maryland
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Algorithms and Networks
Server Allocation for Multiplayer Cloud Gaming
Machine Learning Basics
Machine Learning Today: Reading: Maria Florina Balcan
James B. Orlin Presented by Tal Kaminker
Agreement Protocols CS60002: Distributed Systems
Collaborative Filtering Matrix Factorization Approach
Neural Networks for Vertex Covering
Fault-tolerant Consensus in Directed Networks Lewis Tseng Boston College Oct. 13, 2017 (joint work with Nitin H. Vaidya)
Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
Introduction to Maximum Flows
10701 / Machine Learning Today: - Cross validation,
On the effect of randomness on planted 3-coloring models
Introduction Wireless Ad-Hoc Network
Javad Ghaderi, Tianxiong Ji and R. Srikant
Algorithms and Networks
Lecture 6: Counting triangles Dynamic graphs & sampling
Image Classification & Training of Neural Networks
Signal Processing on Graphs: Performance of Graph Structure Estimation
Corona Robust Low Atomicity Peer-To-Peer Systems
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
ADMM and DSO.
Locality In Distributed Graph Algorithms
Alan Girling University of Birmingham, UK
Presentation transcript:

Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign

Acknowledgements Shripad Gade Lili Su

i

x x1 x2 Applications f1(x) f2(x) i fi(x) = cost for robot i to go to location x Minimize total cost of rendezvous x f1(x) f2(x) x1 x2 Other examples may be: resource allocation (which is similar to robotic rendezvous problems), large-scale distributed machine learning, where data are generated at different locations i

Applications f1(x) f2(x) Learning Minimize cost Σ fi(x) i f3(x) f4(x)

Outline i Privacy Fault-tolerance Distributed Optimization 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 Privacy 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 Fault-tolerance 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 Distributed Optimization

Distributed Optimization Server 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 𝑓 1 𝑓 2 𝑓 3

Client-Server Architecture f1(x) f2(x) 𝑓 1 𝑓 2 𝑓 3 f3(x) f4(x)

Client-Server Architecture Server maintains estimate 𝑥 𝑘 Client i knows 𝑓 𝑖 (𝑥) 𝑥 𝑘 𝑓 1 Server 𝑓 2 𝑓 3

Client-Server Architecture Server maintains estimate 𝑥 𝑘 Client i knows 𝑓 𝑖 (𝑥) In iteration k+1 Client i Download 𝑥 𝑘 from server Upload gradient 𝛻 𝑓 𝑖 ( 𝑥 𝑘 ) 𝑥 𝑘 𝑓 1 Server 𝑓 2 𝑓 3 𝛻 𝑓 𝑖 ( 𝑥 𝑘 )

Client-Server Architecture Server maintains estimate 𝑥 𝑘 Client i knows 𝑓 𝑖 (𝑥) In iteration k+1 Client i Download 𝑥 𝑘 from server Upload gradient 𝛻 𝑓 𝑖 ( 𝑥 𝑘 ) Server 𝑥 𝑘+1 ⟵ 𝑥 𝑘 − 𝛼 𝑘 𝑖 𝛻 𝑓 𝑖 𝑥 𝑘 𝑓 1 Server 𝑓 2 𝑓 3 𝛻 𝑓 𝑖 ( 𝑥 𝑘 )

Variations Stochastic Asynchronous …

Peer-to-Peer Architecture f1(x) f2(x) 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 f3(x) f4(x)

Peer-to-Peer Architecture Each agent maintains local estimate x Consensus step with neighbors Apply own gradient to own estimate 𝑥 𝑘+1 ⟵ 𝑥 𝑘 − 𝛼 𝑘 𝛻 𝑓 𝑖 𝑥 𝑘 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1

Outline i Privacy Fault-tolerance Distributed Optimization 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 Privacy 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 Fault-tolerance 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 Distributed Optimization

𝑓 1 Server 𝑓 2 𝑓 3 𝛻 𝑓 𝑖 ( 𝑥 𝑘 )

Server observes gradients  privacy compromised 𝑓 1 Server 𝑓 2 𝑓 3 𝛻 𝑓 𝑖 ( 𝑥 𝑘 ) Server observes gradients  privacy compromised

Achieve privacy and yet collaboratively optimize 𝑓 1 Server 𝑓 2 𝑓 3 𝛻 𝑓 𝑖 ( 𝑥 𝑘 ) Server observes gradients  privacy compromised Achieve privacy and yet collaboratively optimize

Related Work Cryptographic methods (homomorphic encryption) Function transformation Differential privacy

Differential Privacy 𝑓 1 Server 𝑓 2 𝑓 3 𝛻 𝑓 𝑖 𝑥 𝑘 +𝜺𝒌

Trade-off privacy with accuracy Differential Privacy 𝑓 1 Server 𝑓 2 𝑓 3 𝛻 𝑓 𝑖 𝑥 𝑘 +𝜺𝒌 Trade-off privacy with accuracy

Proposed Approach Motivated by secret sharing Exploit diversity … Multiple servers / neighbors

Privacy if subset of servers adversarial Proposed Approach Server 1 Server 2 𝑓 1 𝑓 2 𝑓 3 Privacy if subset of servers adversarial

Privacy if subset of neighbors adversarial Proposed Approach 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 Privacy if subset of neighbors adversarial

Proposed Approach Structured noise that “cancels” over servers/neighbors

Intuition x1 x2 Server 1 Server 2 𝑓 1 𝑓 2 𝑓 3

x2 x1 Intuition 𝑓 11 𝑓 12 𝑓 21 𝑓 22 𝑓 31 𝑓 32 Server 1 Server 2 Each client simulates multiple clients

𝑓 𝑖𝑗 (𝑥) not necessarily convex Intuition x1 x2 Server 1 Server 2 𝑓 11 𝑓 12 𝑓 21 𝑓 22 𝑓 31 𝑓 32 𝑓 11 (𝑥) + 𝑓 12 𝑥 = 𝑓 1 𝑥 𝑓 𝑖𝑗 (𝑥) not necessarily convex

Algorithm Each server maintains an estimate In each iteration Client i Download estimates from corresponding server Upload gradient of 𝑓 𝑖 Each server updates estimate using received gradients

Algorithm Each server maintains an estimate In each iteration Client i Download estimates from corresponding server Upload gradient of 𝑓 𝑖 Each server updates estimate using received gradients Servers periodically exchange estimates to perform a consensus step

Claim Under suitable assumptions, servers eventually reach consensus in i

Privacy 𝑓 11 + 𝑓 21 + 𝑓 31 𝑓 21 + 𝑓 22 + 𝑓 32 𝑓 11 𝑓 12 𝑓 21 𝑓 22 𝑓 31 𝑓 11 + 𝑓 21 + 𝑓 31 𝑓 21 + 𝑓 22 + 𝑓 32 Server 1 Server 2 𝑓 11 𝑓 12 𝑓 21 𝑓 22 𝑓 31 𝑓 32

Privacy 𝑓 11 + 𝑓 21 + 𝑓 31 𝑓 21 + 𝑓 22 + 𝑓 32 𝑓 11 𝑓 12 𝑓 21 𝑓 22 𝑓 31 𝑓 11 + 𝑓 21 + 𝑓 31 𝑓 21 + 𝑓 22 + 𝑓 32 Server 1 may learn 𝑓 11 , 𝑓 21 , 𝑓 31 , 𝑓 21 + 𝑓 22 + 𝑓 32 Not sufficient to learn 𝑓 𝑖 Server 1 Server 2 𝑓 11 𝑓 12 𝑓 21 𝑓 22 𝑓 31 𝑓 32

𝑓 11 (𝑥) + 𝑓 12 𝑥 = 𝑓 1 𝑥 Function splitting not necessarily practical Structured randomization as an alternative

Structured Randomization Multiplicative or additive noise in gradients Noise cancels over servers

Multiplicative Noise x1 x2 Server 1 Server 2 𝑓 1 𝑓 2 𝑓 3

Multiplicative Noise x1 x2 Server 1 Server 2 𝑓 1 𝑓 2 𝑓 3

x2 x1 𝛼𝛻 𝑓 1 (x1) 𝛽𝛻 𝑓 1 (𝑥2) 𝛼+𝛽=1 Multiplicative Noise 𝑓 1 𝑓 2 𝑓 3 Server 1 Server 2 𝛼𝛻 𝑓 1 (x1) 𝛽𝛻 𝑓 1 (𝑥2) 𝑓 1 𝑓 2 𝑓 3 𝛼+𝛽=1

x2 x1 𝛼𝛻 𝑓 1 (x1) 𝛽𝛻 𝑓 1 (𝑥2) 𝛼+𝛽=1 Multiplicative Noise 𝑓 1 𝑓 2 𝑓 3 Server 1 Server 2 𝛼𝛻 𝑓 1 (x1) 𝛽𝛻 𝑓 1 (𝑥2) 𝑓 1 𝑓 2 𝑓 3 Suffices for this invariant to hold over a larger number of iterations 𝛼+𝛽=1

x2 x1 𝛼𝛻 𝑓 1 (x1) 𝛽𝛻 𝑓 1 (𝑥2) 𝛼+𝛽=1 Multiplicative Noise 𝑓 1 𝑓 2 𝑓 3 Server 1 Server 2 𝛼𝛻 𝑓 1 (x1) 𝛽𝛻 𝑓 1 (𝑥2) 𝑓 1 𝑓 2 𝑓 3 Noise from client i to server j not zero-mean 𝛼+𝛽=1

Claim Under suitable assumptions, servers eventually reach consensus in i

Peer-to-Peer Architecture 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1

𝑥 𝑘+1 ⟵ 𝑥 𝑘 − 𝛼 𝑘 𝛻 𝑓 𝑖 𝑥 𝑘 Reminder … 𝑓 1 𝑓 3 𝑓 2 𝑓 5 𝑓 4 Each agent maintains local estimate x Consensus step with neighbors Apply own gradient to own estimate 𝑥 𝑘+1 ⟵ 𝑥 𝑘 − 𝛼 𝑘 𝛻 𝑓 𝑖 𝑥 𝑘 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1

Proposed Approach 𝑓 1 𝑓 3 𝑓 2 𝑓 5 𝑓 4 Each agent shares noisy estimate with neighbors Scheme 1 – Noise cancels over neighbors Scheme 2 – Noise cancels network-wide 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1

ε1 + ε2 = 0 (over iterations) Proposed Approach Each agent shares noisy estimate with neighbors Scheme 1 – Noise cancels over neighbors Scheme 2 – Noise cancels network-wide x + ε1 ε1 + ε2 = 0 (over iterations) 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 x + ε2

Peer-to-Peer Architecture Poster today Shripad Gade

Outline i Privacy Fault-tolerance Distributed Optimization 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 Privacy 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 Fault-tolerance 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 Distributed Optimization

Fault-Tolerance Some agents may be faulty Need to produce “correct” output despite the faults

Byzantine Fault Model No constraint on misbehavior of a faulty agent May send bogus messages Faulty agents can collude

Peer-to-Peer Architecture fi(x) = cost for robot i to go to location x Faulty agent may choose arbitrary cost function x f1(x) f2(x) x1 x2 Other examples may be: resource allocation (which is similar to robotic rendezvous problems), large-scale distributed machine learning, where data are generated at different locations

Peer-to-Peer Architecture 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1

Client-Server Architecture 𝑓 1 Server 𝑓 2 𝑓 3 𝛻 𝑓 𝑖 ( 𝑥 𝑘 )

Fault-Tolerant Optimization The original problem is not meaningful i

Fault-Tolerant Optimization The original problem is not meaningful Optimize cost over only non-faulty agents i i good

Fault-Tolerant Optimization The original problem is not meaningful Optimize cost over only non-faulty agents i i good Impossible!

Fault-Tolerant Optimization Optimize weighted cost over only non-faulty agents With 𝛂i as close to 1/ good as possible 𝛂i i good

Fault-Tolerant Optimization Optimize weighted cost over only non-faulty agents 𝛂i i good With t Byzantine faulty agents: t weights may be 0

Fault-Tolerant Optimization Optimize weighted cost over only non-faulty agents 𝛂i i good t Byzantine agents, n total agents At least n-2t weights guaranteed to be > 1/2(n-t)

Centralized Algorithm Of the n agents, any t may be faulty How to filter cost functions of faulty agents? X

Centralized Algorithm: Scalar argument x Define a virtual function G(x) whose gradient is obtained as follows

Centralized Algorithm: Scalar argument x Define a virtual function G(x) whose gradient is obtained as follows At a given x Sort the gradients of the n local cost functions

Centralized Algorithm: Scalar argument x Define a virtual function G(x) whose gradient is obtained as follows At a given x Sort the gradients of the n local cost functions Discard smallest t and largest t gradients

Centralized Algorithm: Scalar argument x Define a virtual function G(x) whose gradient is obtained as follows At a given x Sort the gradients of the n local cost functions Discard smallest t and largest t gradients Mean of remaining gradients = Gradient of G at x

Centralized Algorithm: Scalar argument x Define a virtual function G(x) whose gradient is obtained as follows At a given x Sort the gradients of the n local cost functions Discard smallest t and largest t gradients Mean of remaining gradients = Gradient of G at x Virtual function G(x) is convex

Centralized Algorithm: Scalar argument x Define a virtual function G(x) whose gradient is obtained as follows At a given x Sort the gradients of the n local cost functions Discard smallest t and largest t gradients Mean of remaining gradients = Gradient of G at x Virtual function G(x) is convex  Can optimize easily

Peer-to-Peer Fault-Tolerant Optimization Gradient filtering similar to centralized algorithm … require “rich enough” connectivity … correlation between functions helps Vector case harder … redundancy between functions helps

Summary i Privacy Fault-tolerance Distributed Optimization 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 Privacy 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 Fault-tolerance 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1 Distributed Optimization

Thanks! disc.ece.illinois.edu

Distributed Peer-to-Peer Optimization Each agent maintains local estimate x In each iteration Compute weighted average with neighbors’ estimates 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1

Distributed Peer-to-Peer Optimization Each agent maintains local estimate x In each iteration Compute weighted average with neighbors’ estimates Apply own gradient to own estimate 𝑥 𝑘+1 ⟵ 𝑥 𝑘 − 𝛼 𝑘 𝛻 𝑓 𝑖 𝑥 𝑘 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1

Distributed Peer-to-Peer Optimization Each agent maintains local estimate x In each iteration Compute weighted average with neighbors’ estimates Apply own gradient to own estimate Local estimates converge to 𝑥 𝑘+1 ⟵ 𝑥 𝑘 − 𝛼 𝑘 𝛻 𝑓 𝑖 𝑥 𝑘 i 𝑓 3 𝑓 2 𝑓 4 𝑓 5 𝑓 1

RSS – Locally Balanced Perturbations Add to zero (locally per node) Bounded (≤Δ) Algorithm Node j selects d k j,i such that i d k j,i =0 and 𝑑 𝑘 𝑗,𝑖 ≤Δ Share w k j,i = x k j + d k j,i with node i Consensus and (Stochastic) Gradient Descent

RSS – Network Balanced Perturbations Add to zero (over network) Bounded (≤Δ) Algorithm Node j computes perturbation d k j - sends s j,i to i - add received s i,j and subtract sent s j,i ⇒ d k j = rcvd − sent Obfuscate state w k j = x k j + d k j shared with neighbors Consensus and (Stochastic) Gradient Descent

Convergence Let x j T = T α k x k j / T α k and α k =1/ k 𝑓 x j T −𝑓 𝑥 ∗ ≤𝒪 log 𝑇 𝑇 +𝒪 Δ 2 log 𝑇 𝑇 Asymptotic convergence of iterates to optimum Privacy-Convergence Trade-off Stochastic gradient updates work too

Function Sharing Let f i (x) be bounded degree polynomials Algorithm Node j shares s j,i x with node i Node j obfuscates using p j x =∑ s i,j x −∑ s j,i (x) Use f j x = f j x + p j (x) and use distributed gradient descent

Function Sharing - Convergence Function Sharing iterates converge to correct optimum (∑ f i x =f(x)) Privacy: If vertex connectivity of graph ≥ f then no group of f nodes can estimate true functions 𝑓 𝑖 (or any good subset) p j (x) is also similar to f j (x) then it can hide f i x well