SCALING SGD to Big dATA & Huge Models

Slides:

Advertisements

Similar presentations

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.

Advertisements

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Spark: Cluster Computing with Working Sets

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.

Generated Waypoint Efficiency: The efficiency considered here is defined as follows: As can be seen from the graph, for the obstruction radius values (200,

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.

Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Collaborative Filtering Matrix Factorization Approach

Sebastian Schelter, Venu Satuluri, Reza Zadeh

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

SGD ON HADOOP FOR BIG DATA & HUGE MODELS Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Storage in Big Data Systems

“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.

Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.

CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.

RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.

MapReduce How to painlessly process terabytes of data.

Dimensionality Reduction Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.

Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

PETUUM A New Platform for Distributed Machine Learning on Big Data

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization.

ICDCS 2014 Madrid, Spain 30 June-3 July 2014

Bundle Adjustment A Modern Synthesis Bill Triggs, Philip McLauchlan, Richard Hartley and Andrew Fitzgibbon Presentation by Marios Xanthidis 5 th of No.

Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.

Data Structures and Algorithms in Parallel Computing

Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.

Dimensionality Reduction

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning.

Solving the straggler problem with bounded staleness Jim Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton*,

Matrix Factorization 1. Recovering latent factors in a matrix m columns v11 … …… vij … vnm n rows 2.

Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Matrix Factorization Reporter : Sun Yuanshuai

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Dense-Region Based Compact Data Cube

Matrix Factorization and Collaborative Filtering

Large-scale Machine Learning

Distributed Computation Framework for Machine Learning

So far we have covered … Basic visualization algorithms

Matrix Factorization.

湖南大学-信息科学与工程学院-计算机与科学系

February 26th – Map/Reduce

COS 518: Advanced Computer Systems Lecture 12 Mike Freedman

Advanced Artificial Intelligence

Cse 344 May 4th – Map/Reduce.

Collaborative Filtering Matrix Factorization Approach

Logistic Regression & Parallel SGD

Parallel and Distributed Block Coordinate Frank Wolfe

Charles Tappert Seidenberg School of CSIS, Pace University

CS639: Data Management for Data Science

TensorFlow: A System for Large-Scale Machine Learning

Presentation transcript:

SCALING SGD to Big dATA & Huge Models Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos Faloutsos, and Eric Xing

Big Learning Challenges 1 Billion users on Facebook Collaborative Filtering Predict movie preferences Tensor Decomposition Find communities in temporal graphs 300 Million Photos uploaded to Facebook per day! 400 million tweets per day Topic Modeling What are the topics of webpages, tweets, or status updates Dictionary Learning Remove noise or missing pixels from images

Big Data & Huge Model Challenge 2 Billion Tweets covering 300,000 words Break into 1000 Topics More than 2 Trillion parameters to learn Over 7 Terabytes of model 400 million tweets per day Topic Modeling What are the topics of webpages, tweets, or status updates

Outline Background Optimization System Design Experiments Partitioning Constraints & Projections System Design General algorithm How to use Hadoop Distributed normalization “Always-On SGD” – Dealing with stragglers Experiments Future questions

Background

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD)

SGD for Matrix Factorization Movies V Users X U ≈ Genres

SGD for Matrix Factorization V X U ≈ Independent!

The Rise of SGD Hogwild! (Niu et al, 2011) DSGD (Gemulla et al, 2011) Noticed independence If matrix is sparse, there will be little contention Ignore locks DSGD (Gemulla et al, 2011) Broke matrix into blocks

DSGD for Matrix Factorization (Gemulla, 2011) Independent Blocks

DSGD for Matrix Factorization (Gemulla, 2011) Partition your data & model into d × d blocks Results in d=3 strata Process strata sequentially, process blocks in each stratum in parallel

Other Big Learning Platforms GraphLab (Low et al, 2010) – Find independence in graphs PSGD (Zinkevich et al, 2010) – Average independent runs on convex problems Parameter Servers (Li et al, 2014; Ho et al, 2014) Distributed cache of parameters Allow a little “staleness”

Tensor Decomposition

What is a tensor? Tensors are used for structured data > 2 dimensions Think of as a 3D-matrix For example: Derek Jeter plays baseball Subject Object Verb

≈ Tensor Decomposition W V X U Derek Jeter plays baseball Subject Object Verb

Tensor Decomposition W V X ≈ U

Tensor Decomposition Independent W V X ≈ U Not Independent

Tensor Decomposition

For d=3 blocks per stratum, we require d2=9 strata

Coupled Matrix + Tensor Decomposition Subject X Y Object Document Verb

Coupled Matrix + Tensor Decomposition W A V Y X ≈ U

Coupled Matrix + Tensor Decomposition

Constraints & Projections

Example: Topic Modeling Words Topics Documents

Constraints Sometimes we want to restrict response: Non-negative Sparsity Simplex (so vectors become probabilities) Keep inside unit ball

How to enforce? Projections Example: Non-negative

More projections Sparsity (soft thresholding): Simplex Unit ball

Sparse Non-Negative Tensor Factorization Sparse encoding Non-negativity: More interpretable results

Dictionary Learning Learn a dictionary of concepts and a sparse reconstruction Useful for fixing noise and missing pixels of images Sparse encoding Within unit ball

Mixed Membership Network Decomp. Used for modeling communities in graphs (e.g. a social network) Simplex Non-negative

Proof Sketch of Convergence [Details] Regenerative process – each point is used once/epoch Projections are not too big and don’t “wander off” (Lipschitz continuous) Step sizes are bounded: Noise from SGD Projection Normal Gradient Descent Update SGD Constraint error

System design

High level algorithm Stratum 1 Stratum 2 Stratum 3 … for Epoch e = 1 … T do for Subepoch s = 1 … d2 do Let be the set of blocks in stratum s for block b = 1 … d in parallel do Run SGD on all points in block end

Bad Hadoop Algorithm: Subepoch 1 Mappers Reducers Run SGD on Update: Run SGD on Update: U2 V1 W3 Run SGD on U3 V2 W1 Update: U1 V3 W2

Bad Hadoop Algorithm: Subepoch 2 Mappers Reducers Run SGD on Update: Run SGD on Update: U2 V1 W2 Run SGD on U3 V2 W3 Update: U1 V3 W1

Hadoop Challenges MapReduce is typically very bad for iterative algorithms T × d2 iterations Sizable overhead per Hadoop job Little flexibility

High Level Algorithm W3 W3 W2 W2 W1 W1 V1 V1 V2 V2 V3 V3 U1 U1 U2 U2

High Level Algorithm W3 W3 W2 W2 W1 W1 V1 V1 V2 V2 V3 V3 U1 U1 U2 U2

High Level Algorithm W3 W3 W2 W2 W1 W1 V1 V1 V2 V2 V3 V3 U1 U1 U2 U2

with necessary info to order Hadoop Algorithm Reducers Run SGD on Mappers Update: Process points: … U1 V1 W1 HDFS Partition & Sort Map each point Run SGD on Update: U2 V2 W2 to its block HDFS Run SGD on with necessary info to order Update: U3 V3 W3 …

with necessary info to order Hadoop Algorithm Reducers Mappers Process points: Partition & Sort Map each point to its block with necessary info to order

with necessary info to order Hadoop Algorithm Reducers Mappers Process points: … Partition & Sort Map each point to its block with necessary info to order …

with necessary info to order Hadoop Algorithm Reducers Run SGD on Mappers Update: Process points: … U1 V1 W1 Partition & Sort Map each point Run SGD on Update: U2 V2 W2 to its block Run SGD on with necessary info to order Update: U3 V3 W3 …

with necessary info to order Hadoop Algorithm Reducers Run SGD on Mappers Update: Process points: … U1 V1 W1 Partition & Sort Map each point Run SGD on Update: U2 V2 W2 to its block Run SGD on with necessary info to order Update: U3 V3 W3 …

with necessary info to order Hadoop Algorithm Reducers Run SGD on Mappers Update: Process points: … U1 V1 W1 HDFS Partition & Sort Map each point Run SGD on Update: U2 V2 W2 to its block HDFS Run SGD on with necessary info to order Update: U3 V3 W3 …

System Summary Limit storage and transfer of data and model Stock Hadoop can be used with HDFS for communication Hadoop makes the implementation highly portable Alternatively, could also implement on top of MPI or even a parameter server

Distributed Normalization Words Topics π1 β1 Documents π2 β2 π3 β3

Distributed Normalization Transfer σ(b) to all machines Each machine calculates σ: σ(b) is a k-dimensional vector, summing the terms of βb π1 β1 Normalize: σ(2) σ(2) σ(2) σ(1) σ(3) π2 β2 π3 β3 σ(1) σ(1) σ(3) σ(3)

with necessary info to order Barriers & Stragglers Reducers Run SGD on Mappers Update: Process points: … U1 V1 W1 HDFS Wasting time waiting! Partition & Sort Map each point Run SGD on Update: U2 V2 W2 to its block HDFS Run SGD on with necessary info to order Update: U3 V3 W3 …

Solution: “Always-On SGD” For each reducer: Run SGD on all points in current block Z Shuffle points in Z and decrease step size Check if other reducers are ready to sync Wait Run SGD on points in Z again If not ready to sync If not ready to sync Sync parameters and get new block Z

with necessary info to order “Always-On SGD” Reducers Run SGD on old points again! Run SGD on Update: Process points: … U1 V1 W1 HDFS Partition & Sort Map each point Run SGD on Update: U2 V2 W2 to its block HDFS Run SGD on with necessary info to order Update: U3 V3 W3 …

Proof Sketch [Details] Martingale Difference Sequence: At the beginning of each epoch, the expected number of times each point will be processed is equal

Proof Sketch [Details] Martingale Difference Sequence: At the beginning of each epoch, the expected number of times each point will be processed is equal Can use properties of SGD and MDS to show variance decreases with more points used Extra updates are valuable

“Always-On SGD” Reducer 1 Reducer2 Reducer 3 Reducer 4 Use on paramater server Gibbs sampling Reducer 4 First SGD pass of block Z Read Parameters from HDFS Extra SGD Updates Write Parameters to HDFS

Experiments

FlexiFaCT (Tensor Decomposition) Convergence

FlexiFaCT (Tensor Decomposition) Scalability in Data Size

FlexiFaCT (Tensor Decomposition) Scalability in Tensor Dimension Rank=50 Handles up to 2 billion parameters!

FlexiFaCT (Tensor Decomposition) Scalability in Rank of Decomposition Handles up to 4 billion parameters!

Fugue (Using “Always-On SGD”) Dictionary Learning: Convergence

Fugue (Using “Always-On SGD”) Community Detection: Convergence

Fugue (Using “Always-On SGD”) Topic Modeling: Convergence

Fugue (Using “Always-On SGD”) Topic Modeling: Scalability in Data Size GraphLab cannot spill to disk

Fugue (Using “Always-On SGD”) Topic Modeling: Scalability in Rank

Fugue (Using “Always-On SGD”) Topic Modeling: Scalability over Machines

Fugue (Using “Always-On SGD”) Topic Modeling: Number of Machines

Fugue (Using “Always-On SGD”)

Looking forward

Future Questions Do “extra updates” work on other techniques, e.g. Gibbs sampling? Other iterative algorithms? What other problems can be partitioned well? (Model & Data) Can we better choose certain data for extra updates? How can we store large models on disk for I/O efficient updates?

Key Points Flexible method for tensors & ML models Partition both data and model together for efficiency and scalability When waiting for slower machines, run extra updates on old data again Algorithmic & systems challenges in scaling ML can be addressed through statistical innovation

Questions? Alex Beutel abeutel@cs.cmu.edu http://alexbeutel.com Source code available at http://beu.tl/flexifact