Fault Tolerant MapReduce-MPI for HPC Clusters Yanfei Guo , Wesley Bland +, Pavan Balaji +, Xiaobo Zhou * Dept. of Computer Science, University of Colorado,

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

Project presentation by Mário Almeida Implementation of Distributed Systems KTH 1.

Mapreduce and Hadoop Introduce Mapreduce and Hadoop

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

SALSA HPC Group School of Informatics and Computing Indiana University.

Spark: Cluster Computing with Working Sets

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

Distributed Computations

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.

Distributed Computations MapReduce

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

MapReduce ： Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang

MapReduce: Simplified Data Processing on Large Clusters

Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.

Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

Network Support for Cloud Services Lixin Gao, UMass Amherst.

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人：碩資工一甲董耀文.

Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.

Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.

MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

MapReduce How to painlessly process terabytes of data.

MapReduce M/R slides adapted from those of Jeff Dean’s.

MRPGA ： An Extension of MapReduce for Parallelizing Genetic Algorithm Reporter ：古乃卉.

1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing.

ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

MapReduce ： Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

Streaming Big Data with Self-Adjusting Computation Umut A. Acar, Yan Chen DDFP January 2014 SNU IDB Lab. Namyoon Kim.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Next Generation of Apache Hadoop MapReduce Owen

MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

TensorFlow– A system for large-scale machine learning

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

MapReduce Simplied Data Processing on Large Clusters

湖南大学-信息科学与工程学院-计算机与科学系

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

MapReduce: Simplified Data Processing on Large Clusters

Lecture 29: Distributed Systems

Presentation transcript:

Fault Tolerant MapReduce-MPI for HPC Clusters Yanfei Guo *, Wesley Bland +, Pavan Balaji +, Xiaobo Zhou * * Dept. of Computer Science, University of Colorado, Colorado Springs + Mathematics and Computer Science Division, Argonne National Laboratory

Outline  Overview  Backgrounds  Challenges  FT-MRMPI –Design –Checkpoint-Restart –Detect-Resume  Evaluation  Conclusion FT-MRMPI for HPC Clusters, SC15 2

MapReduce on HPC Clusters  What MapReduce Provides –Write serial code and run parallelly –Reliable execution with detect-restart fault tolerance model  HPC Clusters –High Performance CPU, Storage, Network  MapReduce on HPC Clusters –High Performance Big Data Analytics –Reduced Data Movements between Systems FT-MRMPI for HPC Clusters, SC15 3 MiraHadoop CPU16 1.6GHz PPC A2 cores GHz Intel Xeon cores Memory16 GB (1 GB/core) GB (4-8 GB/core) StorageLocal: N/A Shared: 24 PB SAN Local: 500 GB x 8 Shared: N/A Network5D Torus10/40 Gbps Ethernet Software EnvMPI, …Java, … File SystemGPFSHDFS SchedulerCobaltHadoop MapReduce Lib MapReduce-MPIHadoop MapReduce on HPC software stack With Fault Tolerance

Fault Tolerance Model of MapReduce  Master/Worker Model  Detect: Master monitors the all workers  Restart: Affect tasks are rescheduled to another worker FT-MRMPI for HPC Clusters, SC15 4 Master Scheduler Job MapTask ReduceTask Worker Map Slot MapTask Reduce Slot MapTask Worker …

No Fault Tolerance in MPI  MPI: Message Passing Interface –Inter-process Communication –Communicator (COMM)  Frequent Failures at Large Scale –MTTF=4.2 hr (NCSA Blue Waters) –MTTF<1 hr in future  MPI Standard 3.1 –Custom Error Handler –No guarantee that all processes go into the error handler –No fix for a broken COMM FT-MRMPI for HPC Clusters, SC15 5

Scheduling Restrictions  Gang Scheduling –Scheduler all processes at the same time –Preferred by HPC application with extensive synchronizations  MapReduce Scheduling –Per-task scheduling –Schedule each task as early as possible –Compatible with the detect-restart fault tolerance model  Resizing a Running Job –Many platform does not support –Large overhead (re-queueing) The detect-restart fault tolerance model is not compatible with HPC schedulers FT-MRMPI for HPC Clusters, SC15 6

MapReduce Job Overall Design  Fault Tolerant MapReduce using MPI –Reliable Failure Detection and Propagation –Compatible Fault Tolerance Model  FT-MRMPI –Task Runner –Distributed Master & Load Balancer –Failure Handler  Features –Tracable Job Interfaces –HPC Scheduler Compatible Fault Tolerance Models Checkpoint-Restart Detect-Resume FT-MRMPI for HPC Clusters, SC15 7 MPI MapReduce Processs Task Runner Distributed Master Load Balancer Failure Hldr MapReduce Processs Task Runner Distributed Master Load Balancer Failure Hldr …

Task Runner  Tracing, Establish Consistent States –Delegating Operations to the Library  New Interface –Highly Extensible –Embedded Tracing –Record Level Consistency FT-MRMPI for HPC Clusters, SC15 8 User ProgramMR-MPI Map() RD record Process WR KV Call (*func)() template class RecordReader template class RecordWriter class WordRecordReader : public RecordReader template class Mapper template class Reducer void Mapper::map(int& key, string& value, BaseRecordWriter* out, void* param) { out->add(value, 1); } int main(int narg, char** args) { MPI_Init(&narg,&args); mr->map(new WCMapper(), new WCReader(), NULL, NULL); mr->collate(NULL); mr->reduce(new WCReducer(), NULL, new WCWriter(), NULL); }

Distributed Master & Load Balancer  Task Dispatching –Global Task Pool –Job Init –Recovery  Global Consistent State –Shuffle Buffer Tracing  Load Balancing –Monitoring Processing Speed of Tasks –Linear Job Performance Model FT-MRMPI for HPC Clusters, SC15 9 MapReduce Job MapReduce Processs Task Runner Distributed Master Load Balancer Failure Hldr MapReduce Processs Task Runner Distributed Master Load Balancer Failure Hldr … Task Pool Task

Fault Tolerance Model: Checkpoint-Restart  Custom Error Handler –Save and exit gracefully –Propagate failure event with MPI_Abort()  Checkpoint –Asynchronous in phase –Saved locally –Multiple granularity  Restart to Recover –Resubmit w/ -recover –Pickup from where it left FT-MRMPI for HPC Clusters, SC15 10 MapShuffle Reduce RD record Failed ProcessNormal Process Err Hldr MPI_Abort() Other Processes Save States

Where to Write Checkpoint  Write to GPFS –Performance issue due to small I/O –Interferences on shared hardware  Write to Node Local Disk –Fast, no interferences –Global availability in recovery?  Background Data Copier –Write local –Sync to GPFS in background –Overlapping I/O w/ computation Wordcount 100GB, 256 procs, ppn=8 FT-MRMPI for HPC Clusters, SC15 11

Recover Point  Recover to Last File (ft-file) –Less frequent checkpoint –Need reprocess when recover, lost some work  Recover to Last Record (ft-rec) –Require fine grained checkpoint –Skipping records than reprocessing wordcount pagerank FT-MRMPI for HPC Clusters, SC15 12

Drawbacks of Checkpoint-Restart  Checkpoint-Restart works, but not perfect –Large Overhead due to Read/Write Checkpoints –Requires Human intervention –Failure in Recovery FT-MRMPI for HPC Clusters, SC15 13

Fault Tolerance Model: Detect-Resume  Detect –Global knowledge of failure –Identify failed processes by comparing groups  Resume –Fix COMM by excluding failed processes –Balanced distribution of affected tasks –Work-Conserving vs. Non-Work-Conserving  User Level Failure Mitigation (ULFM) –MPIX_Comm_revoke() –MPIX_Comm_shrink() MapShuffle Reduce FT-MRMPI for HPC Clusters, SC15 14 Failed ProcessNormal Process Err Hldr Shrink() Other Processes Revoke()

Evaluation Setup  LCRC Fusion Cluster [1] –256 nodes –CPU: 2-way 8-core Intel Xeon X5550 –Memory: 36GB –Local Disk: 250 GB –Network: Mellanox Infiniband QDR  Benchmarks –Wordcount, BFS, Pagerank –mrmpiBLAST [1] FT-MRMPI for HPC Clusters, SC15 15

Job Performance  10%-13% overhead of checkpointing  Up to 39% shorter completion time with failure FT-MRMPI for HPC Clusters, SC15 16

Checkpoint Overhead  Factors –Granularity: number of records per checkpoint –Size of records FT-MRMPI for HPC Clusters, SC15 17

Time Decomposition  Performance with failure and recovery –Wordcount, All processes together –Detect-Recover has less data that needed to be recovered FT-MRMPI for HPC Clusters, SC15 18

Continuous Failures  Pagerank –256 processes, randomly kill 1 process every 5 secs FT-MRMPI for HPC Clusters, SC15 19

Conclusion  First Fault Tolerant MapReduce Implementation in MPI –Redesign MR-MPI to provide fault tolerance –Highly extensible while providing the essential features for FT  Two Fault Tolerance Model –Checkpoint-Restart –Detect-Resume FT-MRMPI for HPC Clusters, SC15 20

Thank you! Q & A

Backup Slides

Prefetching Data Copier  Recover from GPFS –Reading everything from GPFS –Processes wait for I/O  Prefetching in Recovery –Move from GPFS to local disk –Overlapping I/O with computation

2-Pass KV-KMV Conversion  4-Pass in MR-MPI –Excessive disk I/O when shuffle –Hard to make checkpoints  2-Pass KV-KMV Conversion –Log-Structure File System –KV->Sketch, Sketch->KMV

Recover Time  Recover from local, GPFS, GPFS w/ prefetching FT-MRMPI for HPC Clusters, SC15 25