Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Slides:

Advertisements

Similar presentations

Overview of this week Debugging tips for ML algorithms

Advertisements

Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

SkewTune: Mitigating Skew in MapReduce Applications

Spark: Cluster Computing with Working Sets

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.

Decision Tree under MapReduce Week 14 Part II. Decision Tree.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Hadoop(MapReduce) in the Wild —— Our current understandings & uses of Hadoop Le Zhao, Changkuk Yoo, Mark Hoy, Jamie Callan Presenter: Le Zhao

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Three kinds of learning

Goals of Adaptive Signal Processing Design algorithms that learn from training data Algorithms must have good properties: attain good solutions, simple.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

FLANN Fast Library for Approximate Nearest Neighbors

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

Learning with large datasets Machine Learning Large scale machine learning.

Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.

Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

M Machine Learning F# and Accord.net.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,

Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.

Scaling Distributed Machine Learning with the Parameter Server By M. Li, D. Anderson, J. Park, A. Smola, A. Ahmed, V. Josifovski, J. Long E. Shekita, B.

Linear Models & Clustering Presented by Kwak, Nam-ju 1.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.

Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Statistical Machine Translation Part II: Word Alignments and EM

Chapter 10 Data Analytics for IoT

Chilimbi, et al. (2014) Microsoft Research

Hadoop MapReduce Framework

Classification with Perceptrons Reading:

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

湖南大学-信息科学与工程学院-计算机与科学系

On Spatial Joins in MapReduce

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

Logistic Regression & Parallel SGD

KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner

Chapter5: CPU Scheduling

N-Gram Model Formulas Word sequences Chain rule of probability

Data processing with Hadoop

Distributed Systems CS

MAPREDUCE TYPES, FORMATS AND FEATURES

CS639: Data Management for Data Science

Distributed Edge Computing

David Kauchak CS158 – Spring 2019

COS 518: Distributed Systems Lecture 11 Mike Freedman

CS639: Data Management for Data Science

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith

Outline The Problem Distributed Architecture Experiments and Hadoop Issues

Iterative Training Many problems in NLP and machine learning require iterating over large training sets many times –Training log-linear models (logistic regression, conditional random fields) –Unsupervised or semi-supervised learning with EM (word alignment in MT, grammar induction) –Minimum Error-Rate Training in MT –*Online learning (MIRA, perceptron, stochastic gradient descent) All of the above except * can be easily parallelized –Compute statistics on sections of the data independently –Aggregate them –Update parameters using statistics of full set of data –Repeat until a stopping criterion is met

Dependency Grammar Induction Given sentences of natural language text, infer (dependency) parse trees State-of-the-art results obtained using only a few thousand sentences of length ≤ 10 tokens (Smith and Eisner, 2006) This talk: scaling up to more and longer sentences using Hadoop!

Dependency Grammar Induction Training –Input is a set of sentences (actually, POS tag sequences) and a grammar with initial parameter values –Run an iterative optimization algorithm (EM, LBFGS, etc.) that changes the parameter values on each iteration –Output is a learned set of parameter values Testing –Use grammar with learned parameters to parse a small set of test sentences –Evaluate by computing percentage of predicted edges that match a human annotator

Outline The Problem Distributed Architecture Experiments and Hadoop Issues

MapReduce for Grammar Induction MapReduce was designed for: –Large amounts of data distributed across many disks –Simple data processing We have: –(Relatively) small amounts of data –Expensive processing and high memory requirements

MapReduce for Grammar Induction Algorithms require iterations for convergence –Each iteration requires a full sweep over all training data –Computational bottleneck is computing expected counts for EM on each iteration (gradient for LBFGS) Our approach: run one MapReduce job for each iteration –Map: compute expected counts (gradient) –Reduce: aggregate –Offline: renormalize (EM) or modify parameter values (LBFGS) Note: renormalization could be done in reduce tasks for EM with correct partition functions, but using LBFGS in multiple reduce tasks is trickier

MapReduce Implementation MapReduce Distributed Cache New Parameter Values: p_root(NN) = p_dep(CD | NN, right) = p_dep(DT | NN, right) = … Expected Counts: p_root(NN)0.345 p_root(NN)1.875 p_dep(CD | NN, right)0.175 p_dep(CD | NN, right)0.025 p_dep(DT | NN, right)0.065 … Sentences: [NNP,NNP,VBZ,NNP] [DT,JJ,NN,MD,VB,JJ,NNP,CD] [DT,NN,NN,VBZ,RB,VBN,VBN] … Aggregated Expected Counts: p_root(NN)2.220 p_dep(CD | NN, right)0.200 p_dep(DT | NN, right)0.065 … Server 1.Normalize expected counts to get new parameter values 2.Start new MapReduce job, placing new parameter values on distributed cache Compute expected counts Aggregate expected counts

Running Experiments We use streaming for all experiments with 2 C++ programs: server and map (reduce is a simple summer) > cd /home/kgimpel/grammar_induction > hod allocate –d /home/kgimpel/grammar_induction –n 25 >./dep_induction_server \ input_file=/user/kgimpel/data/train20-20parts \ aux_file=aux.train20 output_file=model.train20 \ hod_config=/home/kgimpel/grammar_induction \ num_reduce_tasks=5 1> stdout 2> stderr dep_induction_server runs a MapReduce job on each iteration Input split into pieces for map tasks (dataset too small for default Hadoop splitter)

Outline The Problem Distributed Architecture Experiments and Hadoop Issues

Speed-up with Hadoop 38,576 sentences ≤ 40 words / sent. 40 nodes 5 reduce tasks Average iteration time reduced from 2039 s to 115 s Total time reduced from 3400 minutes to 200 minutes

Hadoop Issues 1.Overhead of running a single MapReduce job 2.Stragglers in the map phase

23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% Typical Iteration (40 nodes, 38,576 sentences): Consistent 40-second delay between map and reduce phases 115 s per iteration total 40+ s per iteration of overhead When we’re running 100 iterations per experiment, 40 seconds per iteration really adds up! of execution time is overhead!

23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% Typical Iteration (40 nodes, 38,576 sentences): 5 reduce tasks used Reduce phase is simply aggregation of values for 2600 parameters Why does reduce take so long?

Histogram of Iteration Times Mean = ~115 s

Histogram of Iteration Times What’s going on here? Mean = ~115 s

23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% Typical Iteration:

23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% 23:20:27 : map 0% reduce 0% 23:20:34 : map 5% reduce 0% 23:20:35 : map 20% reduce 0% 23:20:36 : map 41% reduce 0% 23:20:37 : map 56% reduce 0% 23:20:38 : map 74% reduce 0% 23:20:39 : map 95% reduce 0% 23:20:40 : map 97% reduce 0% 23:21:32 : map 97% reduce 1% 23:21:37 : map 97% reduce 2% 23:21:42 : map 97% reduce 12% 23:21:43 : map 97% reduce 15% 23:21:47 : map 97% reduce 19% 23:21:50 : map 97% reduce 21% 23:21:52 : map 97% reduce 26% 23:21:57 : map 97% reduce 31% 23:21:58 : map 97% reduce 32% 23:23:46 : map 100% reduce 32% 23:24:54 : map 100% reduce 46% 23:24:55 : map 100% reduce 86% 23:24:56 : map 100% reduce 100% Typical Iteration: Slow Iteration: 3 minutes waiting for last map tasks to complete

23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% 23:20:27 : map 0% reduce 0% 23:20:34 : map 5% reduce 0% 23:20:35 : map 20% reduce 0% 23:20:36 : map 41% reduce 0% 23:20:37 : map 56% reduce 0% 23:20:38 : map 74% reduce 0% 23:20:39 : map 95% reduce 0% 23:20:40 : map 97% reduce 0% 23:21:32 : map 97% reduce 1% 23:21:37 : map 97% reduce 2% 23:21:42 : map 97% reduce 12% 23:21:43 : map 97% reduce 15% 23:21:47 : map 97% reduce 19% 23:21:50 : map 97% reduce 21% 23:21:52 : map 97% reduce 26% 23:21:57 : map 97% reduce 31% 23:21:58 : map 97% reduce 32% 23:23:46 : map 100% reduce 32% 23:24:54 : map 100% reduce 46% 23:24:55 : map 100% reduce 86% 23:24:56 : map 100% reduce 100% Typical Iteration: Slow Iteration: 3 minutes waiting for last map tasks to complete Suggestions? (Doesn’t Hadoop replicate map tasks to avoid this?)

Questions?