Jichuan Chang Computer Sciences Department University of Wisconsin-Madison MW – A Framework to Support.

Slides:



Advertisements
Similar presentations
NGS computation services: API's,
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Using Personal Condor to Solve Quadratic Assignment Problems Jeff Linderoth Axioma, Inc.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Word Wide Cache Distributed Caching for the Distributed Enterprise.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
The material in this presentation is the property of Fair Isaac Corporation. This material has been provided for the recipient only, and shall not be used,
Assignment 3: A Team-based and Integrated Term Paper and Project Semester 1, 2012.
Virtual Machine Hosting for Networked Clusters: Building the Foundations for “Autonomic” Orchestration Based on paper by Laura Grit, David Irwin, Aydan.
MWDriver: An Object-Oriented Library for Master-Worker Applications Mike Yoder, Jeff Linderoth, Jean-Pierre Goux February 26, 1999.
Elastic Applications in the Cloud Dinesh Rajan University of Notre Dame CCL Workshop, June 2012.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.
Grid Computing I CONDOR.
Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.
Parallel Optimization Tools for High Performance Design of Integrated Circuits WISCAD VLSI Design Automation Lab Azadeh Davoodi.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce How to painlessly process terabytes of data.
Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.
BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
CS 346 – Chapter 2 OS services –OS user interface –System calls –System programs How to make an OS –Implementation –Structure –Virtual machines Commitment.
Condor Project Computer Sciences Department University of Wisconsin-Madison Master/Worker and Condor.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.
FATCOP: A Mixed Integer Program Solver Michael FerrisQun Chen Department of Computer Sciences University of Wisconsin-Madison Jeff Linderoth, Argonne.
Project18’s Communication Drawing Design By: Camilo A. Silva BIOinformatics Summer 2008.
MW: A framework to support Master Worker Applications Sanjeev R. Kulkarni Computer Sciences Department University of Wisconsin-Madison
FATCOP: A Mixed Integer Program Solver Michael FerrisQun Chen University of Wisconsin-Madison Jeffrey Linderoth Argonne National Laboratories.
MWDriver: An Object-Oriented Library for Master-Worker Applications Mike Yoder, Jeff Linderoth, Jean-Pierre Goux June 3rd, 1999.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Matthew Farrellee Computer Sciences Department University of Wisconsin-Madison Condor and Web Services.
STAR Scheduling status Gabriele Carcassi 9 September 2002.
Recent Developments in Optimization and their Impact on Control Stephen Wright Argonne National Laboratory
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
IBM Express Runtime Quick Start Workshop © 2007 IBM Corporation Deploying a Solution.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
Sockets A popular API for client-server interaction.
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison
Workload Management Workpackage
How to connect your DG to EDGeS? Zoltán Farkas, MTA SZTAKI
Parallel Programming By J. H. Wang May 2, 2017.
PREGEL Data Management in the Cloud
Master-Worker Tutorial Condor Week 2006
Introduction to Operating Systems
Introduction to Operating Systems
Outline Chapter 2 (cont) OS Design OS structure
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Jichuan Chang Computer Sciences Department University of Wisconsin-Madison MW – A Framework to Support Master-Worker Style Applications

Outline › MW Overview › Current Status › Future Directions

MW = Master-Worker › Master-Worker Style Parallel Applications  Large problem partitioned into small pieces (tasks);  The master manages tasks and resources (worker pool);  Each worker gets a task, execute it, sends the result back, and repeat until all tasks are done;  Examples: ray-tracing, optimization problems, etc. › On Condor (PVM, Globus, … … )  Many opportunities!  Issues (in a Distributed Opportunistic Environment): Resource management, communication, portability; Fault-tolerance, dealing with runtime pool changes.

MW to Simplify the Work! › An OO framework with simple interfaces  3 classes to extend, a few virtual functions to fill;  Scientists can focus on their algorithms. › Lots of Functionality  Handles all the issues in a meta-computing environment;  Provides sufficient info. to make smart decisions. › Many Choices without Changing User Code  Multiple resource managers: Condor, PVM, …  Multiple communication interfaces: PVM, File, Socket, …

Application classes Underlying infrastructure MW’s Layered Architecture Resource Mgr MW abstract classes Communication Layer API IPI Infrastructure Provider’s Interface MWMW MW App.

MW’s Runtime Structure 1.User code adds tasks to the master’s Todo list; 2.Each task is sent to a worker (Todo -> Running); 3.The task is executed by the worker; 4.The result is sent back to the master; 5.User code processes the result (can add/remove tasks). Worker Process Worker Process Worker Process …… Master Process ToDo tasks Running tasks Workers

MW Programming  class Your_Driver: for your master behavior get_userinfo() setup_initial_tasks() act_on_completed_task()  class Your_Worker: for your worker behavior unpack_init_data() benchmark(MWTask *t) execute_task( MWTask *t)  class Your_Task: to store and parse task info pack_work() / unpack_work() pack_results() / unpack_results() Setup Mainloop Pack/unpack

More MW Features › Checkpointing/restarting › IPI and multiple Resource Manager and Communication (RMComm) ports RMCommResource MgrCommunication MW-PVMCondor-PVMPVM MW-FileCondorFiles MW-SocketCondor Socket MW-IndpSingle Hostmemcpy() More RMComm Ports? MW-JavaCondorFiles MW-MPICondor-MPIMPI

MW Summary › It’s simple:  simple API, minimal user code. › It’s powerful:  works on meta-computing platforms. › It’s inexpensive:  On top of Condor, it can exploits 100s of machines. › It solves hard problems!  Nug30, STORM, … …

MW Success Stories › Nug30 solved in 7 days by MW-QAP  Quadratic assignment problem outstanding for 30 years  Utilized 2500 machines from 10 sites NCSA, ANL, UWisc, Gatech, … … 1009 workers at peak, 11 CPU years  › STORM (flight scheduling)  Stochastic programming problem ( 1000M row X 13000M col)  2K times larger than the best sequential program can do  556 workers at peak, 1 CPU year 

MW Users/Collaborators InstituteFor WhatProject Name ANL & UWiscOptimizationFATCOP and ATR UCSDComp. Architecture Research and others JPLImage Processing UIUCOptimization Algebra; Comp. Arch. Research Inst. at PakistanGenerics Algorithm Middleware Scheduling UWiscGrid Middleware SchedulingPOEMS HungaryPerformance VisualizationP-GRADE Sandia NLOptimization and MPI We expect more to come!

Status Update (since 07/2001) › Better config/build system, new app. skeleton › MW-Indp back to work, “insured” the code › Performance measurement and debugging › Support millions of tasks by indexing & swapping › Robustness enhancements  Better handling of host suspension/resume  Better handling of task reassignments › Bug fixes – download from website › Mailing list –

Challenges and Future Work (1) › Scalability  The master bottleneck: only keeps 30% workers busy  Improved worker utilization shown below :  But, how about workers? Time (hr)

Challenges and Future Work (2) › Enhancing Scalability  Worker hierarchy to remove bottleneck  Runtime adaptive throttling of workers  Group tasks to schedule at larger granularity  Need more involvement of application designers › Understanding Performance and Scheduling  To collect data and predict performance  To collect information at runtime  Several groups are studying scheduling for grid middleware (UAB & POEMS)

Challenges and Future Work (3) › Improving Usability  More debugging support  Redesign the current MW API  Support more communication interfaces  Create test suite (and better doc/examples)  Improve logging/error handling. › Solve more and harder computational problems!

Thank You! › Further Information:  Homepage:  Papers:  › BOF session:  Wednesday Morning at 3369, come talk to Jichuan Chang.

MW Backup Slides

Fatcop Recent Run

MW API › Must extend three classes  MWDriver: to define your master behavior;  MWWorker: to define your worker behavior;  MWTask: to store/parse task information. › Might use other MW utilities  MWprintf: to print progress, result, debug info, etc;  MWDriver: to get information, set control policies, etc;  RMC: to specify resource requirements, prepare for communication, etc. Resource Manager & Communicator

MW Programming (1) › class Your_Driver: public MWDriver  Setup get_userinfo(): to parse args and do the initial setup; setup_initial_tasks(): to create initial tasks;  Main loop (event driven) act_on_completed_task(): let user process the result;  Optional: set_task_key_func(), set_***_policy(), set_***_mode(); add_task() / delete_tasks_worse_than() write_master_state() / read_master_state() pack_worker_init_data() / unpack_worker_initinfo()

MW Programming(2) › class Your_Worker: public MWWorker  Setup: unpack_init_data() benchmark(MWTask *t)  Main loop (event driven): execute_task( MWTask *t) › class Your_Task: public MWTask  Pack/Unpack: pack_work() / unpack_work() pack_results() / unpack_results();  Checkpoint/restore write_ckpt_info() / read_ckpt_info()

MW Submit File › Universe  PVM (for MW-CondorPVM)  Scheduler (for MW-File and MW-Socket) › Executable – the master executable › Input (or Arguments)  worker executable name(s);  configuration, input data. › Output – the master’s stdout › Error – the workers’ stdout (and stderr) › Requirements – more requirements

MW Contributors › Jeff Linderoth › Jean-Pierre Goux › Mike Yoder › Sanjeev Kulkarni › Peter Keller › Jichuan Chang › Elisa Heymann › … …