Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Slides:



Advertisements
Similar presentations
Overview of this week Debugging tips for ML algorithms
Advertisements

MapReduce.
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
SkewTune: Mitigating Skew in MapReduce Applications
Spark: Cluster Computing with Working Sets
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Hadoop(MapReduce) in the Wild —— Our current understandings & uses of Hadoop Le Zhao, Changkuk Yoo, Mark Hoy, Jamie Callan Presenter: Le Zhao
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Three kinds of learning
Goals of Adaptive Signal Processing Design algorithms that learn from training data Algorithms must have good properties: attain good solutions, simple.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
FLANN Fast Library for Approximate Nearest Neighbors
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Learning with large datasets Machine Learning Large scale machine learning.
Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
M Machine Learning F# and Accord.net.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
Scaling Distributed Machine Learning with the Parameter Server By M. Li, D. Anderson, J. Park, A. Smola, A. Ahmed, V. Josifovski, J. Long E. Shekita, B.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Statistical Machine Translation Part II: Word Alignments and EM
Chapter 10 Data Analytics for IoT
Chilimbi, et al. (2014) Microsoft Research
Hadoop MapReduce Framework
Classification with Perceptrons Reading:
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
湖南大学-信息科学与工程学院-计算机与科学系
On Spatial Joins in MapReduce
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Logistic Regression & Parallel SGD
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Chapter5: CPU Scheduling
N-Gram Model Formulas Word sequences Chain rule of probability
Data processing with Hadoop
Distributed Systems CS
MAPREDUCE TYPES, FORMATS AND FEATURES
CS639: Data Management for Data Science
Distributed Edge Computing
David Kauchak CS158 – Spring 2019
COS 518: Distributed Systems Lecture 11 Mike Freedman
CS639: Data Management for Data Science
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith

Outline The Problem Distributed Architecture Experiments and Hadoop Issues

Iterative Training Many problems in NLP and machine learning require iterating over large training sets many times –Training log-linear models (logistic regression, conditional random fields) –Unsupervised or semi-supervised learning with EM (word alignment in MT, grammar induction) –Minimum Error-Rate Training in MT –*Online learning (MIRA, perceptron, stochastic gradient descent) All of the above except * can be easily parallelized –Compute statistics on sections of the data independently –Aggregate them –Update parameters using statistics of full set of data –Repeat until a stopping criterion is met

Dependency Grammar Induction Given sentences of natural language text, infer (dependency) parse trees State-of-the-art results obtained using only a few thousand sentences of length ≤ 10 tokens (Smith and Eisner, 2006) This talk: scaling up to more and longer sentences using Hadoop!

Dependency Grammar Induction Training –Input is a set of sentences (actually, POS tag sequences) and a grammar with initial parameter values –Run an iterative optimization algorithm (EM, LBFGS, etc.) that changes the parameter values on each iteration –Output is a learned set of parameter values Testing –Use grammar with learned parameters to parse a small set of test sentences –Evaluate by computing percentage of predicted edges that match a human annotator

Outline The Problem Distributed Architecture Experiments and Hadoop Issues

MapReduce for Grammar Induction MapReduce was designed for: –Large amounts of data distributed across many disks –Simple data processing We have: –(Relatively) small amounts of data –Expensive processing and high memory requirements

MapReduce for Grammar Induction Algorithms require iterations for convergence –Each iteration requires a full sweep over all training data –Computational bottleneck is computing expected counts for EM on each iteration (gradient for LBFGS) Our approach: run one MapReduce job for each iteration –Map: compute expected counts (gradient) –Reduce: aggregate –Offline: renormalize (EM) or modify parameter values (LBFGS) Note: renormalization could be done in reduce tasks for EM with correct partition functions, but using LBFGS in multiple reduce tasks is trickier

MapReduce Implementation MapReduce Distributed Cache New Parameter Values: p_root(NN) = p_dep(CD | NN, right) = p_dep(DT | NN, right) = … Expected Counts: p_root(NN)0.345 p_root(NN)1.875 p_dep(CD | NN, right)0.175 p_dep(CD | NN, right)0.025 p_dep(DT | NN, right)0.065 … Sentences: [NNP,NNP,VBZ,NNP] [DT,JJ,NN,MD,VB,JJ,NNP,CD] [DT,NN,NN,VBZ,RB,VBN,VBN] … Aggregated Expected Counts: p_root(NN)2.220 p_dep(CD | NN, right)0.200 p_dep(DT | NN, right)0.065 … Server 1.Normalize expected counts to get new parameter values 2.Start new MapReduce job, placing new parameter values on distributed cache Compute expected counts Aggregate expected counts

Running Experiments We use streaming for all experiments with 2 C++ programs: server and map (reduce is a simple summer) > cd /home/kgimpel/grammar_induction > hod allocate –d /home/kgimpel/grammar_induction –n 25 >./dep_induction_server \ input_file=/user/kgimpel/data/train20-20parts \ aux_file=aux.train20 output_file=model.train20 \ hod_config=/home/kgimpel/grammar_induction \ num_reduce_tasks=5 1> stdout 2> stderr dep_induction_server runs a MapReduce job on each iteration Input split into pieces for map tasks (dataset too small for default Hadoop splitter)

Outline The Problem Distributed Architecture Experiments and Hadoop Issues

Speed-up with Hadoop 38,576 sentences ≤ 40 words / sent. 40 nodes 5 reduce tasks Average iteration time reduced from 2039 s to 115 s Total time reduced from 3400 minutes to 200 minutes

Hadoop Issues 1.Overhead of running a single MapReduce job 2.Stragglers in the map phase

23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% Typical Iteration (40 nodes, 38,576 sentences): Consistent 40-second delay between map and reduce phases 115 s per iteration total 40+ s per iteration of overhead When we’re running 100 iterations per experiment, 40 seconds per iteration really adds up! of execution time is overhead!

23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% Typical Iteration (40 nodes, 38,576 sentences): 5 reduce tasks used Reduce phase is simply aggregation of values for 2600 parameters Why does reduce take so long?

Histogram of Iteration Times Mean = ~115 s

Histogram of Iteration Times What’s going on here? Mean = ~115 s

23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% Typical Iteration:

23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% 23:20:27 : map 0% reduce 0% 23:20:34 : map 5% reduce 0% 23:20:35 : map 20% reduce 0% 23:20:36 : map 41% reduce 0% 23:20:37 : map 56% reduce 0% 23:20:38 : map 74% reduce 0% 23:20:39 : map 95% reduce 0% 23:20:40 : map 97% reduce 0% 23:21:32 : map 97% reduce 1% 23:21:37 : map 97% reduce 2% 23:21:42 : map 97% reduce 12% 23:21:43 : map 97% reduce 15% 23:21:47 : map 97% reduce 19% 23:21:50 : map 97% reduce 21% 23:21:52 : map 97% reduce 26% 23:21:57 : map 97% reduce 31% 23:21:58 : map 97% reduce 32% 23:23:46 : map 100% reduce 32% 23:24:54 : map 100% reduce 46% 23:24:55 : map 100% reduce 86% 23:24:56 : map 100% reduce 100% Typical Iteration: Slow Iteration: 3 minutes waiting for last map tasks to complete

23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% 23:20:27 : map 0% reduce 0% 23:20:34 : map 5% reduce 0% 23:20:35 : map 20% reduce 0% 23:20:36 : map 41% reduce 0% 23:20:37 : map 56% reduce 0% 23:20:38 : map 74% reduce 0% 23:20:39 : map 95% reduce 0% 23:20:40 : map 97% reduce 0% 23:21:32 : map 97% reduce 1% 23:21:37 : map 97% reduce 2% 23:21:42 : map 97% reduce 12% 23:21:43 : map 97% reduce 15% 23:21:47 : map 97% reduce 19% 23:21:50 : map 97% reduce 21% 23:21:52 : map 97% reduce 26% 23:21:57 : map 97% reduce 31% 23:21:58 : map 97% reduce 32% 23:23:46 : map 100% reduce 32% 23:24:54 : map 100% reduce 46% 23:24:55 : map 100% reduce 86% 23:24:56 : map 100% reduce 100% Typical Iteration: Slow Iteration: 3 minutes waiting for last map tasks to complete Suggestions? (Doesn’t Hadoop replicate map tasks to avoid this?)

Questions?