Download presentation
Presentation is loading. Please wait.
Published byJuliana Walsh Modified over 9 years ago
1
MW: A framework to support Master Worker Applications Sanjeev R. Kulkarni Computer Sciences Department University of Wisconsin-Madison sanjeevk@cs.wisc.edu
2
www.cs.wisc.edu/condor 2 Outline of the talk › Introduction to MW › Architecture of MW › How to use MW? › Records shattered by MW! › Extensions
3
www.cs.wisc.edu/condor 3 What is MW? › MW = Master Worker › Object Oriented, Fault Tolerant framework for Master-Worker Applications › Can run on a variety of resource managers like Condor, PVM, Globus,...
4
www.cs.wisc.edu/condor 4 MW › Object Oriented MW is a set of C++ base classes Users write a few virtual functions › Fault Tolerant Handles workers joining/leaving Handles suspended/resumed workers Checkpointing
5
www.cs.wisc.edu/condor 5 MW Framework Application MWLayer Resource Management and Communication Layer E.g. Condor, PVM, Globus,...
6
www.cs.wisc.edu/condor 6 Why use MW? › Handles communication layer › Handles resource Management Useful especially in an opportunistic environment like Condor › Same application code can run on various resource managers
7
www.cs.wisc.edu/condor 7 MW Layer › Three main entities MWDriver corresponds to the (Tyrant) Master MWWorker corresponds to the (exploited) Worker MWTask The work itself!
8
www.cs.wisc.edu/condor 8 MWDriver: The Master › Setup › Manages workers joining/leaving › Deals with HostSuspend/HostDelete › Maintains lists of tasks ToDo, Done, Running › Acts on completed tasks › Checkpointing
9
www.cs.wisc.edu/condor 9 MWWorker: The Worker › Executes the given task Unpacks the task got from Master Executes the task Packs results and sends back to Master
10
www.cs.wisc.edu/condor 10 MWTask: The Work › Definition of a Unit of work › Holds work to be done and results › Follows the path Master-Worker- Master
11
www.cs.wisc.edu/condor 11 Master Workers Running To Do T1 T2 T3... Global Data Done
12
www.cs.wisc.edu/condor 12 Master Workers Running To Do T2 T3 T4... Global Data Wid 1 W1 T1 Done
13
www.cs.wisc.edu/condor 13 Master Workers Running To Do T5 T6 T7... Global Data Wid 1 W1 T1 T2 T3 T4 Wid 2Wid 3 Wid 4 W2W3 W4 Done
14
www.cs.wisc.edu/condor 14 Master Workers Running To Do T6 T7 T8... Global Data Wid 1 W1 T1 T2 T5 T4 Wid 2Wid 3 Wid 4 W2W3 W4 T3 Done
15
www.cs.wisc.edu/condor 15 Master Workers Running To Do T2 T6 T7... Global Data Wid 1 W1 T1 T5 T4 Wid 2Wid 3 Wid 4 W2W3 W4 T3 Done
16
www.cs.wisc.edu/condor 16 Intelligent Scheduling of Tasks › Some tasks may be harder › Some machines may be faster › Task ordering based on user defined MWKey › Machine ordering based on Resource Manager information e.g. Condor ClassAds
17
www.cs.wisc.edu/condor 17 Checkpointing › Save master state in case of failure › Automatic restart from checkpoint file in case of master failure › User controlled checkpoint frequency › Users need to implement two additional functions to take advantage of these.
18
www.cs.wisc.edu/condor 18 How can you use MW? › MWDriver’s virtual functions get_user_info () - Processes arguments and does basic setup setup_initial_tasks () - Fills in the ToDo list pack_worker_init_data () - packs init data for worker act_on_completed_tasks () - called every time a task completes. Optional.
19
www.cs.wisc.edu/condor 19 Using MW (cont.) › MWWorker unpack_init_data () - unpack the init data sent by the master execute_task () - execute a task and fill in the results › MWTask pack_work (), unpack_work () pack_results (), unpack_results ()
20
www.cs.wisc.edu/condor 20 Resource Management and Communication Layer › Presently two implementations exist MW-PVM Communication :PVM Resource Manager : Condor-PVM MW-File Communication : Files Resource Manager : Condor
21
www.cs.wisc.edu/condor 21 MW-Independent › Single process version master single worker › sends and receives become memcpy › No changes in source! › Why? Faster debugging
22
www.cs.wisc.edu/condor 22 Data Management › Exploitation of available memory as compared to CPU cycles › Aims to reduce access to storage site by caching data on network › Aimed at applications involving extremely large data sets
23
www.cs.wisc.edu/condor 23 Data Manager › Partitions large data sets into chunks › Workers work on chunks allocated by manager Data Manager W1W2W3 Storage DeviceMemory
24
www.cs.wisc.edu/condor 24 World Record Spree!! › Stochastic Linear Programming “Record breaking” problem of over 8.5M rows and 35M columns solved › Quadratic Assignment Problem nug27 in a little more than two days! › Jeff will talk more in his talk
25
www.cs.wisc.edu/condor 25 Extensions › More Resource Manager ports Globus A thread based version for SMPs
26
www.cs.wisc.edu/condor 26 Scalability Issues › Scales linearly with respect to the number of workers › OK for 200 workers but for 1000? › Solution Hierarchical MW?
27
www.cs.wisc.edu/condor 27 More about MW › http://www.cs.wisc.edu/condor/mw › Demos of MW in action using Condor pool tomorrow in Room 3397
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.