Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith
Outline The Problem Distributed Architecture Experiments and Hadoop Issues
Iterative Training Many problems in NLP and machine learning require iterating over large training sets many times –Training log-linear models (logistic regression, conditional random fields) –Unsupervised or semi-supervised learning with EM (word alignment in MT, grammar induction) –Minimum Error-Rate Training in MT –*Online learning (MIRA, perceptron, stochastic gradient descent) All of the above except * can be easily parallelized –Compute statistics on sections of the data independently –Aggregate them –Update parameters using statistics of full set of data –Repeat until a stopping criterion is met
Dependency Grammar Induction Given sentences of natural language text, infer (dependency) parse trees State-of-the-art results obtained using only a few thousand sentences of length ≤ 10 tokens (Smith and Eisner, 2006) This talk: scaling up to more and longer sentences using Hadoop!
Dependency Grammar Induction Training –Input is a set of sentences (actually, POS tag sequences) and a grammar with initial parameter values –Run an iterative optimization algorithm (EM, LBFGS, etc.) that changes the parameter values on each iteration –Output is a learned set of parameter values Testing –Use grammar with learned parameters to parse a small set of test sentences –Evaluate by computing percentage of predicted edges that match a human annotator
Outline The Problem Distributed Architecture Experiments and Hadoop Issues
MapReduce for Grammar Induction MapReduce was designed for: –Large amounts of data distributed across many disks –Simple data processing We have: –(Relatively) small amounts of data –Expensive processing and high memory requirements
MapReduce for Grammar Induction Algorithms require iterations for convergence –Each iteration requires a full sweep over all training data –Computational bottleneck is computing expected counts for EM on each iteration (gradient for LBFGS) Our approach: run one MapReduce job for each iteration –Map: compute expected counts (gradient) –Reduce: aggregate –Offline: renormalize (EM) or modify parameter values (LBFGS) Note: renormalization could be done in reduce tasks for EM with correct partition functions, but using LBFGS in multiple reduce tasks is trickier
MapReduce Implementation MapReduce Distributed Cache New Parameter Values: p_root(NN) = p_dep(CD | NN, right) = p_dep(DT | NN, right) = … Expected Counts: p_root(NN)0.345 p_root(NN)1.875 p_dep(CD | NN, right)0.175 p_dep(CD | NN, right)0.025 p_dep(DT | NN, right)0.065 … Sentences: [NNP,NNP,VBZ,NNP] [DT,JJ,NN,MD,VB,JJ,NNP,CD] [DT,NN,NN,VBZ,RB,VBN,VBN] … Aggregated Expected Counts: p_root(NN)2.220 p_dep(CD | NN, right)0.200 p_dep(DT | NN, right)0.065 … Server 1.Normalize expected counts to get new parameter values 2.Start new MapReduce job, placing new parameter values on distributed cache Compute expected counts Aggregate expected counts
Running Experiments We use streaming for all experiments with 2 C++ programs: server and map (reduce is a simple summer) > cd /home/kgimpel/grammar_induction > hod allocate –d /home/kgimpel/grammar_induction –n 25 >./dep_induction_server \ input_file=/user/kgimpel/data/train20-20parts \ aux_file=aux.train20 output_file=model.train20 \ hod_config=/home/kgimpel/grammar_induction \ num_reduce_tasks=5 1> stdout 2> stderr dep_induction_server runs a MapReduce job on each iteration Input split into pieces for map tasks (dataset too small for default Hadoop splitter)
Outline The Problem Distributed Architecture Experiments and Hadoop Issues
Speed-up with Hadoop 38,576 sentences ≤ 40 words / sent. 40 nodes 5 reduce tasks Average iteration time reduced from 2039 s to 115 s Total time reduced from 3400 minutes to 200 minutes
Hadoop Issues 1.Overhead of running a single MapReduce job 2.Stragglers in the map phase
23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% Typical Iteration (40 nodes, 38,576 sentences): Consistent 40-second delay between map and reduce phases 115 s per iteration total 40+ s per iteration of overhead When we’re running 100 iterations per experiment, 40 seconds per iteration really adds up! of execution time is overhead!
23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% Typical Iteration (40 nodes, 38,576 sentences): 5 reduce tasks used Reduce phase is simply aggregation of values for 2600 parameters Why does reduce take so long?
Histogram of Iteration Times Mean = ~115 s
Histogram of Iteration Times What’s going on here? Mean = ~115 s
23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% Typical Iteration:
23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% 23:20:27 : map 0% reduce 0% 23:20:34 : map 5% reduce 0% 23:20:35 : map 20% reduce 0% 23:20:36 : map 41% reduce 0% 23:20:37 : map 56% reduce 0% 23:20:38 : map 74% reduce 0% 23:20:39 : map 95% reduce 0% 23:20:40 : map 97% reduce 0% 23:21:32 : map 97% reduce 1% 23:21:37 : map 97% reduce 2% 23:21:42 : map 97% reduce 12% 23:21:43 : map 97% reduce 15% 23:21:47 : map 97% reduce 19% 23:21:50 : map 97% reduce 21% 23:21:52 : map 97% reduce 26% 23:21:57 : map 97% reduce 31% 23:21:58 : map 97% reduce 32% 23:23:46 : map 100% reduce 32% 23:24:54 : map 100% reduce 46% 23:24:55 : map 100% reduce 86% 23:24:56 : map 100% reduce 100% Typical Iteration: Slow Iteration: 3 minutes waiting for last map tasks to complete
23:17:05 : map 0% reduce 0% 23:17:12 : map 3% reduce 0% 23:17:13 : map 26% reduce 0% 23:17:14 : map 49% reduce 0% 23:17:15 : map 66% reduce 0% 23:17:16 : map 72% reduce 0% 23:17:17 : map 97% reduce 0% 23:17:18 : map 100% reduce 0% 23:18:00 : map 100% reduce 1% 23:18:15 : map 100% reduce 2% 23:18:18 : map 100% reduce 4% 23:18:20 : map 100% reduce 15% 23:18:27 : map 100% reduce 17% 23:18:28 : map 100% reduce 18% 23:18:30 : map 100% reduce 23% 23:18:32 : map 100% reduce 100% 23:20:27 : map 0% reduce 0% 23:20:34 : map 5% reduce 0% 23:20:35 : map 20% reduce 0% 23:20:36 : map 41% reduce 0% 23:20:37 : map 56% reduce 0% 23:20:38 : map 74% reduce 0% 23:20:39 : map 95% reduce 0% 23:20:40 : map 97% reduce 0% 23:21:32 : map 97% reduce 1% 23:21:37 : map 97% reduce 2% 23:21:42 : map 97% reduce 12% 23:21:43 : map 97% reduce 15% 23:21:47 : map 97% reduce 19% 23:21:50 : map 97% reduce 21% 23:21:52 : map 97% reduce 26% 23:21:57 : map 97% reduce 31% 23:21:58 : map 97% reduce 32% 23:23:46 : map 100% reduce 32% 23:24:54 : map 100% reduce 46% 23:24:55 : map 100% reduce 86% 23:24:56 : map 100% reduce 100% Typical Iteration: Slow Iteration: 3 minutes waiting for last map tasks to complete Suggestions? (Doesn’t Hadoop replicate map tasks to avoid this?)
Questions?