SC’07 Demo Draft VGrADS Team June 2007
Two vgES mechanisms to support FTR vgLaunch vgLaunch LooseBagOf(cluster) LooseBagOf(cluster) ClusterOf ClusterOf Broadcast / overprovision Restart / migration
FTR Mode - Step 1: Find Workflow Execution Manager FTR vgrid vgES vgdl = LooseBagOf (cluster) [5] { cluster = ClusterOf(node) [16] { node = [WRF == true] } } vgrid LooseBagOf(cluster) vgES ClusterOf
Find: Input Interfaces to FTR vgrid annotated with following for VC’s BQP NWS MDS Reliability (if available) Mapping from virtual to real cluster for performance model vgrid LooseBagOf(cluster) LB0 VC0 VC1 VC2 VC3 VC4 ClusterOf VC: virtual cluster
Step 2: FTR decision + Bind vgrid LooseBagOf(cluster) LB0 + Annotations VC0 VC1 VC2 VC3 VC4 ClusterOf FTR Bind on VC1, VC2 and VC3
Step 3: Execution in Broadcast Mode vgLaunch FTR vgrid LooseBagOf(cluster) LB0 vgLaunch on vgrid [broadcast on VC1, VC2 and VC3] VC0 VC1 VC2 VC3 VC4 vgES ClusterOf run, status, cancel Broadcast / overprovision
Step 3: Execution in Restart Mode vgLaunch FTR vgrid LooseBagOf(cluster) LB0 vgLaunch on vgrid [run on VC1; restart on VC2 on failure] VC0 VC1 VC2 VC3 VC4 vgES ClusterOf Restart
Step 3: Execution in Migration Mode (Future) vgLaunch FTR vgrid LooseBagOf(cluster) LB0 vgLaunch on vgrid [migration path - VC1 to VC2 to VC3] VC0 VC1 VC2 VC3 VC4 vgES ClusterOf Migration
Step 4: Monitor Execution Broadcast case run and monitor status for each copy of application cancel remaining copies on success application fails if all copies fail Restart case run and monitor status for one copy call-back FTR with new vgrid (or pruned vgrid) FTR decides another target vc and calls vgLaunch repeat
Constantly collecting data over time SC’07 Demo Flow Resource Broker Constantly collecting data over time Performance Model Batch Queue Prediction If not reserved resource, ask - Is it time to submit? (Reserved) Query the performance model for task’s resource requirements Execution System Virtual Grid DAG + Constraint Here is the workflow and constraints + pointer to performance model. Give me a mapping Find me two slots (vgFind) GT4 GRAM If reserved submit PBS-glidin at slot start time else submit when BQP suggests (Reserved) Annotated DAG Scheduler Mapper PBS Return slots above threshold Return mapping Bind Resources (vgBind) Use performance model and map the tasks to the slots. If deadline can’t be met, return. (Reserved) Planning Execution Normal Mode vgLaunch Run Job Slot PBS Globus Gateway FTR (vgFind) Run job vg + annotations FTR Mode Run Job** (vgBind) vgLaunch**
Current Issues Key decision point Requirement Comment Ryan’s scheduler plans for entire workflow before execution of any step of the workflow FTR operates dynamically as every workflow step is executed Requirement Ability to execute on subset of VCs in a vgrid Comment Can’t determine “redundancy” without knowing the available vgrid and annotations
Demo and Beyond Current demo scenario is “above the line” implementation vgES provides the mechanisms and FTR provides the smarts Longer term Pushing FTR “below the line” (inside vgES) Integrating reliability aspects during workflow planning High-reliability bag
Milestones Testbed Nail down interfaces between vgES and FTR List of machines – accounts, keys, certs vgES and LEAD application installation Test old vgES + scheduler + RB software stack Nail down interfaces between vgES and FTR Implementation of vgES mechanisms Developer’s workshop New vgES (with multiple submissions) release vgES and FTR component test
Milestones Dummy FTR + new vgES test Test FTR working with vgrid test calls from FTR to vgES Test FTR working with vgrid extracting vgrid and annotations via new interfaces FTR and vgES code freeze Test demo scenario for October workshop scenario resulting in multiple submissions October all-hands workshop “glue” code freeze SC’07 demo
Milestones Dates, responsibilities and details TBD