MW: A framework to support Master Worker Applications Sanjeev R. Kulkarni Computer Sciences Department University of Wisconsin-Madison

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

Threads, SMP, and Microkernels
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.
Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/15 EICE team Model-Level Debugging of Embedded Real-Time Systems Wolfgang Haberl, Markus.
Figure 1.1 Interaction between applications and the operating system.
CS533 - Concepts of Operating Systems
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Condor Project Computer Sciences Department University of Wisconsin-Madison Asynchronous Notification in Condor By Vidhya Murali.
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Jim Basney Computer Sciences Department University of Wisconsin-Madison Managing Network Resources in.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Pregel: A System for Large-Scale Graph Processing
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Computer System Architectures Computer System Software
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Chapter 4 Threads, SMP, and Microkernels Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E.
MWDriver: An Object-Oriented Library for Master-Worker Applications Mike Yoder, Jeff Linderoth, Jean-Pierre Goux February 26, 1999.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Parallel Optimization Tools for High Performance Design of Integrated Circuits WISCAD VLSI Design Automation Lab Azadeh Davoodi.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.
Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.
Condor Project Computer Sciences Department University of Wisconsin-Madison Master/Worker and Condor.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
FATCOP: A Mixed Integer Program Solver Michael FerrisQun Chen Department of Computer Sciences University of Wisconsin-Madison Jeff Linderoth, Argonne.
Jichuan Chang Computer Sciences Department University of Wisconsin-Madison MW – A Framework to Support.
MWDriver: An Object-Oriented Library for Master-Worker Applications Mike Yoder, Jeff Linderoth, Jean-Pierre Goux June 3rd, 1999.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.
1.1 Sandeep TayalCSE Department MAIT 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming Batched Systems Time-Sharing Systems.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor and Virtual Machines.
Background Computer System Architectures Computer System Software.
Next Generation of Apache Hadoop MapReduce Owen
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Applied Operating System Concepts
TensorFlow– A system for large-scale machine learning
Condor – A Hunter of Idle Workstation
PREGEL Data Management in the Cloud
Introduction to Operating Systems
湖南大学-信息科学与工程学院-计算机与科学系
Master-Worker Tutorial Condor Week 2006
Introduction to locality sensitive approach to distributed systems
Operating Systems.
Operating System Concepts
Interpret the execution mode of SQL query in F1 Query paper
Chapter 2: Operating-System Structures
Chapter 2: Operating-System Structures
Operating System Concepts
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MW: A framework to support Master Worker Applications Sanjeev R. Kulkarni Computer Sciences Department University of Wisconsin-Madison

2 Outline of the talk › Introduction to MW › Architecture of MW › How to use MW? › Records shattered by MW! › Extensions

3 What is MW? › MW = Master Worker › Object Oriented, Fault Tolerant framework for Master-Worker Applications › Can run on a variety of resource managers like Condor, PVM, Globus,...

4 MW › Object Oriented  MW is a set of C++ base classes  Users write a few virtual functions › Fault Tolerant  Handles workers joining/leaving  Handles suspended/resumed workers  Checkpointing

5 MW Framework Application MWLayer Resource Management and Communication Layer E.g. Condor, PVM, Globus,...

6 Why use MW? › Handles communication layer › Handles resource Management  Useful especially in an opportunistic environment like Condor › Same application code can run on various resource managers

7 MW Layer › Three main entities  MWDriver corresponds to the (Tyrant) Master  MWWorker corresponds to the (exploited) Worker  MWTask The work itself!

8 MWDriver: The Master › Setup › Manages workers joining/leaving › Deals with HostSuspend/HostDelete › Maintains lists of tasks  ToDo, Done, Running › Acts on completed tasks › Checkpointing

9 MWWorker: The Worker › Executes the given task  Unpacks the task got from Master  Executes the task  Packs results and sends back to Master

10 MWTask: The Work › Definition of a Unit of work › Holds work to be done and results › Follows the path Master-Worker- Master

11 Master Workers Running To Do T1 T2 T3... Global Data Done

12 Master Workers Running To Do T2 T3 T4... Global Data Wid 1 W1 T1 Done

13 Master Workers Running To Do T5 T6 T7... Global Data Wid 1 W1 T1 T2 T3 T4 Wid 2Wid 3 Wid 4 W2W3 W4 Done

14 Master Workers Running To Do T6 T7 T8... Global Data Wid 1 W1 T1 T2 T5 T4 Wid 2Wid 3 Wid 4 W2W3 W4 T3 Done

15 Master Workers Running To Do T2 T6 T7... Global Data Wid 1 W1 T1 T5 T4 Wid 2Wid 3 Wid 4 W2W3 W4 T3 Done

16 Intelligent Scheduling of Tasks › Some tasks may be harder › Some machines may be faster › Task ordering based on user defined MWKey › Machine ordering based on Resource Manager information  e.g. Condor ClassAds

17 Checkpointing › Save master state in case of failure › Automatic restart from checkpoint file in case of master failure › User controlled checkpoint frequency › Users need to implement two additional functions to take advantage of these.

18 How can you use MW? › MWDriver’s virtual functions  get_user_info () - Processes arguments and does basic setup  setup_initial_tasks () - Fills in the ToDo list  pack_worker_init_data () - packs init data for worker  act_on_completed_tasks () - called every time a task completes. Optional.

19 Using MW (cont.) › MWWorker  unpack_init_data () - unpack the init data sent by the master  execute_task () - execute a task and fill in the results › MWTask  pack_work (), unpack_work ()  pack_results (), unpack_results ()

20 Resource Management and Communication Layer › Presently two implementations exist  MW-PVM Communication :PVM Resource Manager : Condor-PVM  MW-File Communication : Files Resource Manager : Condor

21 MW-Independent › Single process version  master  single worker › sends and receives become memcpy › No changes in source! › Why?  Faster debugging

22 Data Management › Exploitation of available memory as compared to CPU cycles › Aims to reduce access to storage site by caching data on network › Aimed at applications involving extremely large data sets

23 Data Manager › Partitions large data sets into chunks › Workers work on chunks allocated by manager Data Manager W1W2W3 Storage DeviceMemory

24 World Record Spree!! › Stochastic Linear Programming  “Record breaking” problem of over 8.5M rows and 35M columns solved › Quadratic Assignment Problem  nug27 in a little more than two days! › Jeff will talk more in his talk

25 Extensions › More Resource Manager ports  Globus  A thread based version for SMPs 

26 Scalability Issues › Scales linearly with respect to the number of workers › OK for 200 workers but for 1000? › Solution  Hierarchical MW?

27 More about MW › › Demos of MW in action using Condor pool tomorrow in Room 3397