Distributed Machine Learning: Communication, Efficiency, and Privacy Avrim Blum [RaviKannan60] Joint work with Maria-Florina Balcan, Shai Fine, and Yishay.

Slides:



Advertisements
Similar presentations
Is Random Model Better? -On its accuracy and efficiency-
Advertisements

A small taste of inferential statistics
Semi-Supervised Learning Avrim Blum Carnegie Mellon University [USC CS Distinguished Lecture Series, 2008]
New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.
Boosting Approach to ML
G. Alonso, D. Kossmann Systems Group
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012] And Other Cool Stuff.
Semi-Supervised Learning
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
An brief tour of Differential Privacy Avrim Blum Computer Science Dept Your guide:
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Active Learning of Binary Classifiers
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Probably Approximately Correct Model (PAC)
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Shuchi Chawla, Carnegie Mellon University Static Optimality and Dynamic Search Optimality in Lists and Trees Avrim Blum Shuchi Chawla Adam Kalai 1/6/2002.
Adaboost and its application
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Machine Learning for Mechanism Design and Pricing Problems Avrim Blum Carnegie Mellon University Joint work with Maria-Florina Balcan, Jason Hartline,
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
Commitment without Regrets: Online Learning in Stackelberg Security Games Nika Haghtalab Carnegie Mellon University Joint work with Maria-Florina Balcan,
Machine Learning Theory Maria-Florina Balcan Lecture 1, Jan. 12 th 2010.
Online Learning Algorithms
Incorporating Unlabeled Data in the Learning Process
Machine Learning Theory Maria-Florina (Nina) Balcan Lecture 1, August 23 rd 2011.
Data mining and machine learning A brief introduction.
SVM by Sequential Minimal Optimization (SMO)
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Chapter 8 Introduction to Inference Target Goal: I can calculate the confidence interval for a population Estimating with Confidence 8.1a h.w: pg 481:
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Learning with AdaBoost
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.
Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.
Learning with General Similarity Functions Maria-Florina Balcan.
Reconstructing Preferences from Opaque Transactions Avrim Blum Carnegie Mellon University Joint work with Yishay Mansour (Tel-Aviv) and Jamie Morgenstern.
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Reading: R. Schapire, A brief introduction to boosting
Stochastic Streams: Sample Complexity vs. Space Complexity
Understanding Generalization in Adaptive Data Analysis
Privacy-preserving Release of Statistics: Differential Privacy
Distributed Machine Learning
Differential Privacy in Practice
Semi-Supervised Learning
Computational Learning Theory
Ensembles.
Computational Learning Theory
The Byzantine Secretary Problem
Scott Aaronson (UT Austin) UNM, Albuquerque, October 18, 2018
Presentation transcript:

Distributed Machine Learning: Communication, Efficiency, and Privacy Avrim Blum [RaviKannan60] Joint work with Maria-Florina Balcan, Shai Fine, and Yishay Mansour Carnegie Mellon University

And thank you for many enjoyable years working together on challenging problems where machine learning meets high- dimensional geometry

This talk Algorithms for machine learning in distributed, cloud-computing context. Related to interest of Ravi’s in algorithms for cloud-computing. For full details see [Balcan-B-Fine-Mansour COLT’12]

Machine Learning What is Machine Learning about? –Making useful, accurate generalizations or predictions from data. –Given access to sample of some population, classified in some way, want to learn some rule that will have high accuracy over population as a whole. Typical ML problems: Given sample of images, classified as male or female, learn a rule to classify new images.

Machine Learning What is Machine Learning about? –Making useful, accurate generalizations or predictions from data. –Given access to sample of some population, classified in some way, want to learn some rule that will have high accuracy over population as a whole. Typical ML problems: Given set of protein sequences, labeled by function, learn rule to predict functions of new proteins.

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations.

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. Click data

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. Customer data

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. Scientific data

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. Each has only a piece of the overall data pie

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. In order to learn over the combined D, holders will need to communicate.

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. Classic ML question: how much data is needed to learn a given type of function well?

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. These settings bring up a new question: how much communication? Plus issues like privacy, etc. That is the focus of this talk.

Distributed Learning: Scenarios Two natural high-level scenarios: 1.Each location has data from same distribution. –So each could in principle learn on its own. –But want to use limited communication to speed up – ideally to centralized learning rate. [Dekel, Giliad- Bachrach, Shamir, Xiao] 2.Overall distribution arbitrarily partitioned. –Learning without communication is impossible. –This will be our focus here.

The distributed PAC learning model Goal is to learn unknown function f 2 C given labeled data from some prob. distribution D. However, D is arbitrarily partitioned among k entities (players) 1,2,…,k. [k=2 is interesting]

The distributed PAC learning model Goal is to learn unknown function f 2 C given labeled data from some prob. distribution D. However, D is arbitrarily partitioned among k entities (players) 1,2,…,k. [k=2 is interesting] Players can sample (x,f(x)) from their own D i. 1 2 … k D 1 D 2 … D k D = (D 1 + D 2 + … + D k )/k

The distributed PAC learning model Goal is to learn unknown function f 2 C given labeled data from some prob. distribution D. However, D is arbitrarily partitioned among k entities (players) 1,2,…,k. [k=2 is interesting] Players can sample (x,f(x)) from their own D i. 1 2 … k D 1 D 2 … D k Goal: learn good rule over combined D.

The distributed PAC learning model Interesting special case to think about: –k=2. –One has the positives and one has the negatives. –How much communication to learn, e.g., a good linear separator? –In general, view k as small compared to sample size needed for learning.

The distributed PAC learning model Some simple baselines. Baseline #1: based on fact that can learn any class of VC-dim d to error ² from O(d/ ² log 1/ ² ) samples –Each player sends 1/k fraction of this to player 1 –Player 1 finds good rule h over sample. Sends h to others. –Total: 1 round, O(d/ ² log 1/ ² ) examples sent. D 1 D 2 … D k

The distributed PAC learning model Some simple baselines. Baseline #2: Suppose function class has an online algorithm A with mistake-bound M. E.g., Perceptron algorithm learns linear separators of margin ° with mistake-bound O(1/ ° 2 ). D 1 D 2 … D k

The distributed PAC learning model Some simple baselines. Baseline #2: Suppose function class has an online algorithm A with mistake-bound M. –Player 1 runs A, broadcasts current hypothesis. –If any player has a counterexample, sends to player 1. Player 1 updates, re-broadcasts. –At most M examples and rules communicated. D 1 D 2 … D k

Dependence on 1/ ² Had linear dependence in d and 1/ ², or M and no dependence on 1/ ². [ ² = final error rate] Can you get O(d log 1/ ² ) examples of communication? Yes. Distributed boosting D 1 D 2 … D k

Distributed Boosting Idea: Run baseline #1 for ² = ¼. [everyone sends a small amount of data to player 1, enough to learn to error ¼] Get initial rule h 1, send to others. D 1 D 2 … D k

Distributed Boosting Idea: Players then reweight their D i to focus on regions h 1 did poorly. Repeat D 1 D 2 … D k Distributed implementation of Adaboost Algorithm. Some additional low-order communication needed too (players send current performance level to #1, so can request more data from players where h doing badly). Key point: each round uses only O(d) samples and lowers error multiplicatively.

Distributed Boosting Final result: O(d) examples of communication per round + low order extra bits. O(log 1/ ² ) rounds of communication. So, O(d log 1/ ² ) examples of communication in total plus low order extra info. D 1 D 2 … D k

Agnostic learning (no perfect h) [Balcan-Hanneke] give robust halving alg that can be implemented in distributed setting. Based on analysis of a generalized active learning model. Algorithms especially suited to distributed setting. D 1 D 2 … D k

Agnostic learning (no perfect h) [Balcan-Hanneke] give robust halving alg that can be implemented in distributed setting. Get error 2*OPT(C) + ² using total of only O(k log|C| log(1/ ² )) examples. Not computationally efficient, but says logarithmic dependence possible in principle. D 1 D 2 … D k

Can we do better for specific classes of functions? D 1 D 2 … D k

Interesting class: parity functions Examples x 2 {0,1} d. f(x) = x ¢ v f mod 2, for unknown v f. Interesting for k=2. Classic communication LB for determining if two subspaces intersect. Implies  (d 2 ) bits LB to output good v. What if allow rules that “look different”? D 1 D 2 … D k D 1 D 2

Interesting class: parity functions [if x in subspace spanned by S, predict accordingly, else say “??”] Svector v h S x f(x) ??

D 1 D 2 Interesting class: parity functions Examples x 2 {0,1} d. f(x) = x ¢ v f mod 2, for unknown v f. Algorithm: –Each player i PAC-learns over D i to get parity function g i. Also R-U learns to get rule h i. Sends g i to other player. –Uses rule: “if h i predicts, use it; else use g 3-i.” –Can one extend to k=3? g1h1g1h1 g2h2g2h2

Linear Separators Can one do better? Linear separators thru origin. (can assume pts on sphere) Say we have a near-uniform prob. distrib. D over S d. VC-bound, margin bound, Perceptron mistake-bound all give O(d) examples needed to learn, so O(d) examples of communication using baselines (for constant k, ² ).

Linear Separators Idea: Use margin-version of Perceptron alg [update until f(x)(w ¢ x) ¸ 1 for all x] and run round-robin

Linear Separators Idea: Use margin-version of Perceptron alg [update until f(x)(w ¢ x) ¸ 1 for all x] and run round-robin. So long as examples x i of player i and x j of player j are reasonably orthogonal, updates of player j don’t mess too much with data of player i. –Few updates ) no damage. –Many updates ) lots of progress!

Linear Separators Idea: Use margin-version of Perceptron alg [update until f(x)(w ¢ x) ¸ 1 for all x] and run round-robin. If overall distrib. D is near uniform [density bounded by c ¢ unif], then total communication (for constant k, ² ) is O((d log d) 1/2 ) rather than O(d). Get similar savings for general distributions?

Preserving Privacy of Data Natural also to consider privacy in this setting. Data elements could be patient records, customer records, click data. Want to preserve privacy of individuals involved. Compelling notion of differential privacy: if replace any one record with fake record, nobody else can tell. [Dwork, Nissim, …] S 1 ~ D 1 S 2 ~ D 2 … S k ~ D k

Preserving Privacy of Data Natural also to consider privacy in this setting. S 1 ~ D 1 S 2 ~ D 2 … S k ~ D k ¼ 1- ² ¼ 1+ ² probability over randomness in A For all sequences of interactions ¾, e - ² · Pr(A(S i )= ¾ )/Pr(A(S i ’)= ¾ ) · e ²

Preserving Privacy of Data Natural also to consider privacy in this setting. S 1 ~ D 1 S 2 ~ D 2 … S k ~ D k A number of algorithms have been developed for differentially-private learning in centralized setting. Can ask how to maintain without increasing communication overhead.

Preserving Privacy of Data Another notion that is natural to consider in this setting. S 1 ~ D 1 S 2 ~ D 2 … S k ~ D k A kind of privacy for data holder. View distrib D i as non-sensitive (statistical info about population of people who are sick in city i). But the sample S i » D i is sensitive (actual patients). Reveal no more about S i other than inherent in D i ?

Preserving Privacy of Data Another notion that is natural to consider in this setting. S 1 ~ D 1 S 2 ~ D 2 … S k ~ D k Not guaranteed by differential privacy. E.g., consider query “what fraction of S i has HIV?” DP allows adding noise / 1/|S i |. Could be smaller than sample variation 1/|S i | 1/2.

Preserving Privacy of Data Another notion that is natural to consider in this setting. S 1 ~ D 1 S 2 ~ D 2 … S k ~ D k DiDi SiSi Protocol Want to reveal no more info about S i than is inherent in D i.

DiDi SiSi S’ i Protocol Pr S i, S’ i [ 8 ¾, Pr(A(S i )= ¾ )/Pr(A(S’ i )= ¾ ) 2 1 § ² ] ¸ 1 - ±. Preserving Privacy of Data Another notion that is natural to consider in this setting. Actual sample “Ghost sample” sample Can get algorithms with this guarantee

Conclusions As we move to large distributed datasets, communication issues become important. Rather than only ask “how much data is needed to learn well”, also ask “how much communication do we need?” Also issues like privacy become more critical. Quite a number of open questions.