Data Driven Resource Allocation for Distributed Learning

Slides:

Advertisements

Similar presentations

Greening Backbone Networks Shutting Off Cables in Bundled Links Will Fisher, Martin Suchara, and Jennifer Rexford Princeton University.

Advertisements

Lindsey Bleimes Charlie Garrod Adam Meyerson

Random Forest Predrag Radenković 3237/10

Chapter 3 Workforce scheduling.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

GRAPH BALANCING. Scheduling on Unrelated Machines J1 J2 J3 J4 J5 M1 M2 M3.

Tighter and Convex Maximum Margin Clustering Yu-Feng Li (LAMDA, Nanjing University, China) Ivor W. Tsang.

Progress in Linear Programming Based Branch-and-Bound Algorithms

Scheduling with Outliers Ravishankar Krishnaswamy (Carnegie Mellon University) Joint work with Anupam Gupta, Amit Kumar and Danny Segev.

Mathematical Modeling and Optimization: Summary of “Big Ideas”

F.F. Dragan (Kent State) A.B. Kahng (UCSD) I. Mandoiu (Georgia Tech/UCLA) S. Muddu (Silicon Graphics) A. Zelikovsky (Georgia State) Provably Good Global.

Dynamic lot sizing and tool management in automated manufacturing systems M. Selim Aktürk, Siraceddin Önen presented by Zümbül Bulut.

Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.

2-Layer Crossing Minimisation Johan van Rooij. Overview Problem definitions NP-Hardness proof Heuristics & Performance Practical Computation One layer:

Job Scheduling Lecture 19: March 19. Job Scheduling: Unrelated Multiple Machines There are n jobs, each job has: a processing time p(i,j) (the time to.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.

Integer Programming Difference from linear programming –Variables x i must take on integral values, not real values Lots of interesting problems can be.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.

Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000

Distributed Combinatorial Optimization

Review of Reservoir Problem OR753 October 29, 2014 Remote Sensing and GISc, IST.

1 Introduction to Approximation Algorithms Lecture 15: Mar 5.

Multipath Routing Algorithms for Congestion Minimization Ron Banner and Ariel Orda Department of Electrical Engineering Technion- Israel Institute of Technology.

FLANN Fast Library for Approximate Nearest Neighbors

Approximation Algorithms: Bristol Summer School 2008 Seffi Naor Computer Science Dept. Technion Haifa, Israel TexPoint fonts used in EMF. Read the TexPoint.

Daniel Kroening and Ofer Strichman Decision Procedures An Algorithmic Point of View Deciding ILPs with Branch & Bound ILP References: ‘Integer Programming’

Decision Procedures An Algorithmic Point of View

LP-based Algorithms for Capacitated Facility Location Chaitanya Swamy Joint work with Retsef Levi and David Shmoys Cornell University.

Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.

Towards a Fast Heuristic for MINLP John W. Chinneck, M. Shafique Systems and Computer Engineering Carleton University, Ottawa, Canada.

Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

A Polynomial Time Approximation Scheme For Timing Constrained Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, Charles J. Alpert** *Dept of Electrical.

Efficient and Scalable Computation of the Energy and Makespan Pareto Front for Heterogeneous Computing Systems Kyle M. Tarplee 1, Ryan Friese 1, Anthony.

New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,

Towards Efficient Large-Scale VPN Monitoring and Diagnosis under Operational Constraints Yao Zhao, Zhaosheng Zhu, Yan Chen, Northwestern University Dan.

Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Fast and accurate energy minimization for static or time-varying Markov Random Fields (MRFs) Nikos Komodakis (Ecole Centrale Paris) Nikos Paragios (Ecole.

Column Generation By Soumitra Pal Under the guidance of Prof. A. G. Ranade.

Branch-and-Cut Valid inequality: an inequality satisfied by all feasible solutions Cut: a valid inequality that is not part of the current formulation.

Machine Learning Queens College Lecture 7: Clustering.

Lecture.6. Table of Contents Lp –rounding Dual Fitting LP-Duality.

Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.

Lecture 6 – Integer Programming Models Topics General model Logic constraint Defining decision variables Continuous vs. integral solution Applications:

A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Shiyan Hu*, Zhuo Li**, Charles Alpert** *Dept of Electrical.

Common Intersection of Half-Planes in R 2 2 PROBLEM (Common Intersection of half- planes in R 2 ) Given n half-planes H 1, H 2,..., H n in R 2 compute.

Tommy Messelis * Stefaan Haspeslagh Burak Bilgin Patrick De Causmaecker Greet Vanden Berghe *

Discrete Optimization MA2827 Fondements de l’optimisation discrète Material from P. Van Hentenryck’s course.

Clustering Data Streams

Alexander Shekhovtsov and Václav Hlaváč

Scalable Load-Distance Balancing

The minimum cost flow problem

Unsupervised Learning

Server Allocation for Multiplayer Cloud Gaming

Optimizing depot locations based on a public transportation timetable

k-center Clustering under Perturbation Resilience

Chapter 6. Large Scale Optimization

Integer Programming (정수계획법)

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.

COSC 4335: Other Classification Techniques

Approximation Algorithms

Fair Clustering through Fairlets ( NIPS 2017)

Hierarchical Search on DisCSPs

Integer Programming (정수계획법)

Hierarchical Search on DisCSPs

15th Scandinavian Workshop on Algorithm Theory

Chapter 1. Formulations.

Branch-and-Bound Algorithm for Integer Program

Chapter 6. Large Scale Optimization

Presentation transcript:

Data Driven Resource Allocation for Distributed Learning Travis Dick, Mu Li, Venkata Krishna Pillutla, Colin White, Maria-Florina Balcan, Alex Smola

Outline Problem Clustering-based Data Partitioning Summary of results Efficiently Clustering Data LP-Rounding Approximation Algorithm Beyond worst-case analysis: 𝑘-means++ Efficiency from subsampling Experimental results Accuracy comparison Scaling experiments

The problem Want to distribute data to multiple machines for efficient machine learning. How should we partition the data? Data 1st Partition Machine 1 2nd Partition Machine 2 … … 𝑛 𝑡ℎ Partition Machine 𝑛 Common idea: randomly partition the data. Clean both in theory and practice, but suboptimal.

Our approach Cluster the data and send one cluster to each machine. Accurate models tend to be locally simple but globally complex.

Our approach Cluster the data and send one cluster to each machine. Accurate models tend to be locally simple but globally complex. Machine 2 Machine 1 Machine 3 Each machine learns a model for its local data. Additional clustering constraints: Balance: Clusters should have roughly the same number of points. Replication: Each point should be assigned to 𝑝 clusters.

Summary of results Efficient algorithms with provable worst-case guarantees for balanced clustering with replication. For non-worst case data, we show that common clustering algorithms produce high-quality balanced clusterings. We show how to efficiently partition a large dataset by clustering only a small sample. We empirically show that our technique significantly outperforms baseline methods and strongly scales.

How can we efficiently compute balanced clusterings with replication?

Balanced 𝑘-means with replication Given a dataset 𝑆… Choose 𝑘 centers 𝑐 1 ,…, 𝑐 𝑘 ∈𝑆, Assign each 𝑥∈𝑆 to 𝑝 centers: 𝑓 1 𝑥 ,…, 𝑓 𝑝 𝑥 ∈ 𝑐 1 ,…, 𝑐 𝑘 𝑘-means cost: 𝑥∈𝑆 𝑖=1 𝑝 𝑑 𝑥, 𝑓 𝑖 𝑥 2 . Balance constraints: Each center has between ℓ 𝑆 and 𝐿|𝑆| points. x x

LP-Rounding Algorithm Can formulate problem as an integer program (NP Hard to solve exactly). The linear program relaxation can be solved efficiently… but gives ”fractional” centers and “fractional” assignments. Compute a coarse clustering of the data using a simple greedy procedure. Combine centers within each coarse cluster to form “whole” centers. Find optimal assignments by solving a min-cost flow problem. Theorem The LP-rounding algorithm returns a constant factor approximation for balanced 𝑘-means clustering with replication when 𝑝≥2 and violates the upper capacities by at most 𝑝+2 2 . * We have analogous results for 𝑘-median and 𝑘-center clustering as well.

Beyond worst-case: 𝑘-means++ For non-worst-case data, common algorithms also work well! a clustering instance satisfies 𝛼,𝜀 -approximation stability if all clusterings 𝐶 with 𝑐𝑜𝑠𝑡 𝐶 ≤ 1+𝛼 𝑂𝑃𝑇 are 𝜀-close to the optimal clustering. [1] R Q Theorem 𝑘-means++ seeding with greedy pruning [5] outputs a nearly optimal solution for balanced clustering instances satisfying 𝛼,𝜀 -approximation stability

Efficiency from Subsampling Goal: Cluster a small sample of data and use this to partition entire dataset with good guarantees. Assign new point to the same clusters as its nearest neighbor. x Automatically handles balance constraints. New point costs about the same as neighbor. ? x x Theorem If the cost on sample is ≤𝑟⋅𝑂𝑃𝑇, then the cost of the extended clustering is ≤4𝑟⋅𝑂𝑃𝑇+𝑂 𝑟𝑝 𝐷 2 𝜖+𝑝𝑟𝛼+𝑟𝛽 𝐷 is the diameter of the dataset 𝛼,𝛽 measure how representative the sample is, and go to zero as sample size grows.

How well does our method perform?

Experimental Setup Goal: Measure the difference in accuracy from different partitioning methods. Partition the data using one of the partitioning methods. Learn a model on each machine. Report the test accuracy. Repeat for multiple values of 𝑘 Larger values of 𝑘 are more parallelizable.

Baseline Methods Method Fast? Balanced? Locality? Random ✔ 𝗫 Balanced Partition Tree (kd-tree) 1/2 Locality Sensitive Hashing Our Method

Experimental Evaluation MNIST-8M Synthetic CIFAR10_IN3C CIFAR10_IN4D CTR

Strong Scaling For a fixed dataset we evaluate the running time of our method using 8, 16, 32, or 64 machines. We report the speedup over using 8 machines. For all datasets, doubling the number of workers reduces running time by a constant fraction (i.e., our method strongly scales).

Conclusion Propose using balanced clustering with replication for data partitioning in DML. LP-Rounding algorithm with worst-case guarantees. Beyond worst-case analysis for k-means++. Efficiently partition large datasets by clustering a sample. Empirical support for utility of clustering-based partitioning. Empirically demonstrated strong scaling.

Thanks!

Extra Slides (You’ve gone too far!)

Capacitated 𝑘-means with replication Choose 𝑘 centers and assign every point to 𝑝 centers so that points are ”close” to their centers and each cluster is roughly the same size. As an Integer Program: Number the points 1 through 𝑛. Variable 𝑦 𝑖 = “1 if point 𝑖 is a center, 0 otherwise.” Variable 𝑥 𝑖,𝑗 = “1 if point 𝑗 is assigned to point 𝑖.” Minimize 𝑖,𝑗 𝑥 𝑖,𝑗 𝑑 𝑖,𝑗 2 Subject to: 𝑖 𝑥 𝑖,𝑗 =𝑝 for all points 𝑗 (𝑝 assignments) 𝑖 𝑦 𝑖 =𝑘 (𝑘 centers) ℓ𝑛 𝑦 𝑖 ≤ 𝑗 𝑥 𝑖,𝑗 ≤𝐿𝑛 𝑦 𝑖 for all points 𝑖 (balancedness) 𝑥 𝑖,𝑗 , 𝑦 𝑖 ∈ 0,1 for all points 𝑖,𝑗. Linear Program Relaxation: 𝑥 𝑖,𝑗 , 𝑦 𝑖 ∈ 0,1 𝑥s are “fractional assignment”, 𝑦s are “fractional centers”

LP-Rounding algorithm Solve the LP to get fractional 𝑦s and 𝑥s. Compute a very coarse clustering of the points using a greedy procedure. Within each coarse cluster, round the 𝑦s to 0 or 1. Find the optimal integral 𝑥s by solving a min-cost flow.

LP-Rounding algorithm Solve the LP to get fractional 𝑦s and 𝑥s. Compute a very coarse clustering of the points using a greedy procedure. Within each coarse cluster, round the 𝑦s to 0 or 1. Find the optimal integral 𝑥s by solving a min-cost flow.

LP-Rounding algorithm Solve the LP to get fractional 𝑦s and 𝑥s. Compute a very coarse clustering of the points using a greedy procedure. Within each coarse cluster, round the 𝑦s to 0 or 1. Find the optimal integral 𝑥s by solving a min-cost flow.

LP-Rounding algorithm Solve the LP to get fractional 𝑦s and 𝑥s. Compute a very coarse clustering of the points using a greedy procedure. Within each coarse cluster, round the 𝑦s to 0 or 1. Find the optimal integral 𝑥s by solving a min-cost flow.

LP-Rounding algorithm Solve the LP to get fractional 𝑦s and 𝑥s. Compute a very coarse clustering of the points using a greedy procedure. Within each coarse cluster, round the 𝑦s to 0 or 1. Find the optimal integral 𝑥s by solving a min-cost flow.

LP-Rounding Algorithm Solve the LP to get fractional 𝑦s and 𝑥s. Compute a very coarse clustering of the points using a greedy procedure. Within each coarse cluster, round the 𝑦s to 0 or 1. Find the optimal integral 𝑥s by solving a min-cost flow. Theorem The LP-rounding algorithm returns a constant factor approximation for capacitated 𝑘-means clustering with replication when 𝑝>1 and violates the upper capacities by at most 𝑝+2 2 . * We have analogous results for 𝑘-means and 𝑘-center clustering as well.