SC’07 Demo Draft VGrADS Team June 2007.

Slides:



Advertisements
Similar presentations
Pegasus on the Virtual Grid: A Case Study of Workflow Planning over Captive Resources Yang-Suk Kee, Eun-Kyu Byun, Ewa Deelman, Kran Vahi, Jin-Soo Kim Oracle.
Advertisements

Resource Management of Grid Computing
Open Science Grid Frank Würthwein UCSD. 2/13/2006 GGF 2 “Airplane view” of the OSG  High Throughput Computing — Opportunistic scavenging on cheap hardware.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
Site Report US CMS T2 Workshop Samir Cury on behalf of T2_BR_UERJ Team.
VIRTUALISATION OF HADOOP CLUSTERS Dr G Sudha Sadasivam Assistant Professor Department of CSE PSGCT.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Assignment 3: A Team-based and Integrated Term Paper and Project Semester 1, 2012.
ExTASY 0.1 Beta Testing 1 st April 2015
Resource Management and Accounting Working Group Working Group Scope and Components Progress made Current issues being worked Next steps Discussions involving.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.
Workflow Project Luciano Piccoli Illinois Institute of Technology.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
TOSCA Monitoring Working Group Status Roger Dev June 17, 2015.
3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Issues in (Financial) High Performance Computing John Darlington Director Imperial College Internet Centre Fast Financial Algorithms and Computing 4th.
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
GridLab Resource Management System (GRMS) Jarek Nabrzyski GridLab Project Coordinator Poznań Supercomputing and.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
INFSO-RI Enabling Grids for E-sciencE glexec on worker nodes David Groep NIKHEF.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
Next Generation of Apache Hadoop MapReduce Owen
VgES Version 0.7 Release Overview UCSD VGrADS Team Andrew A. Chien, Henri Casanova, Yang-suk Kee, Jerry Chou, Dionysis Logothetis, Richard.
Slot Acquisition Presenter: Daniel Nurmi. Scope One aspect of VGDL request is the time ‘slot’ when resources are needed –Earliest time when resource set.
- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
VGES Demonstrations Andrew A. Chien, Henri Casanova, Yang-suk Kee, Richard Huang, Dionysis Logothetis, and Jerry Chou CSE, SDSC, and CNS University of.
CE design report Luigi Zangrando
Lessons from LEAD/VGrADS Demo Yang-suk Kee, Carl Kesselman ISI/USC.
EU 2nd Year Review – Feb – WP1 Demo – n° 1 WP1 demo Grid “logical” checkpointing Fabrizio Pacini (Datamat SpA, WP1 )
Resource access in the EGEE project Massimo Sgaravatto INFN Padova
Workload Management Workpackage
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Dynamic Deployment of VO Specific Condor Scheduler using GT4
U.S. ATLAS Grid Production Experience
Example: Rapid Atmospheric Modeling System, ColoState U
LEAD-VGrADS Day 1 Notes.
Towards GLUE Schema 2.0 Sergio Andreozzi INFN-CNAF Bologna, Italy
Resource Characterization
HSA Reusability Issues
New Workflow Scheduling Techniques Presentation: Anirban Mandal
The CREAM CE: When can the LCG-CE be replaced?
Abstract Machine Layer Research in VGrADS
Testing Activities on the CERT-TB Status report
Building Grids with Condor
Interoperability & Standards
LCG middleware and LHC experiments ARDA project
湖南大学-信息科学与工程学院-计算机与科学系
Welcome Traceability Software Integrators
Wide Area Workload Management Work Package DATAGRID project
Overall Project RAG Status:
Overview of Workflows: Why Use Them?
On the Use of Service Level Agreements in AssessGrid
GRID Workload Management System for CMS fall production
Overall Project RAG Status:
Overall Project RAG Status:
Gordon Erlebacher Florida State University
JRA 1 Progress Report ETICS 2 All-Hands Meeting
Presentation transcript:

SC’07 Demo Draft VGrADS Team June 2007

Two vgES mechanisms to support FTR vgLaunch vgLaunch LooseBagOf(cluster) LooseBagOf(cluster) ClusterOf ClusterOf Broadcast / overprovision Restart / migration

FTR Mode - Step 1: Find Workflow Execution Manager FTR vgrid vgES vgdl = LooseBagOf (cluster) [5] { cluster = ClusterOf(node) [16] { node = [WRF == true] } } vgrid LooseBagOf(cluster) vgES ClusterOf

Find: Input Interfaces to FTR vgrid annotated with following for VC’s BQP NWS MDS Reliability (if available) Mapping from virtual to real cluster for performance model vgrid LooseBagOf(cluster) LB0 VC0 VC1 VC2 VC3 VC4 ClusterOf VC: virtual cluster

Step 2: FTR decision + Bind vgrid LooseBagOf(cluster) LB0 + Annotations VC0 VC1 VC2 VC3 VC4 ClusterOf FTR Bind on VC1, VC2 and VC3

Step 3: Execution in Broadcast Mode vgLaunch FTR vgrid LooseBagOf(cluster) LB0 vgLaunch on vgrid [broadcast on VC1, VC2 and VC3] VC0 VC1 VC2 VC3 VC4 vgES ClusterOf run, status, cancel Broadcast / overprovision

Step 3: Execution in Restart Mode vgLaunch FTR vgrid LooseBagOf(cluster) LB0 vgLaunch on vgrid [run on VC1; restart on VC2 on failure] VC0 VC1 VC2 VC3 VC4 vgES ClusterOf Restart

Step 3: Execution in Migration Mode (Future) vgLaunch FTR vgrid LooseBagOf(cluster) LB0 vgLaunch on vgrid [migration path - VC1 to VC2 to VC3] VC0 VC1 VC2 VC3 VC4 vgES ClusterOf Migration

Step 4: Monitor Execution Broadcast case run and monitor status for each copy of application cancel remaining copies on success application fails if all copies fail Restart case run and monitor status for one copy call-back FTR with new vgrid (or pruned vgrid) FTR decides another target vc and calls vgLaunch repeat

Constantly collecting data over time SC’07 Demo Flow Resource Broker Constantly collecting data over time Performance Model Batch Queue Prediction If not reserved resource, ask - Is it time to submit? (Reserved) Query the performance model for task’s resource requirements Execution System Virtual Grid DAG + Constraint Here is the workflow and constraints + pointer to performance model. Give me a mapping Find me two slots (vgFind) GT4 GRAM If reserved submit PBS-glidin at slot start time else submit when BQP suggests (Reserved) Annotated DAG Scheduler Mapper PBS Return slots above threshold Return mapping Bind Resources (vgBind) Use performance model and map the tasks to the slots. If deadline can’t be met, return. (Reserved) Planning Execution Normal Mode vgLaunch Run Job Slot PBS Globus Gateway FTR (vgFind) Run job vg + annotations FTR Mode Run Job** (vgBind) vgLaunch**

Current Issues Key decision point Requirement Comment Ryan’s scheduler plans for entire workflow before execution of any step of the workflow FTR operates dynamically as every workflow step is executed Requirement Ability to execute on subset of VCs in a vgrid Comment Can’t determine “redundancy” without knowing the available vgrid and annotations

Demo and Beyond Current demo scenario is “above the line” implementation vgES provides the mechanisms and FTR provides the smarts Longer term Pushing FTR “below the line” (inside vgES) Integrating reliability aspects during workflow planning High-reliability bag

Milestones Testbed Nail down interfaces between vgES and FTR List of machines – accounts, keys, certs vgES and LEAD application installation Test old vgES + scheduler + RB software stack Nail down interfaces between vgES and FTR Implementation of vgES mechanisms Developer’s workshop New vgES (with multiple submissions) release vgES and FTR component test

Milestones Dummy FTR + new vgES test Test FTR working with vgrid test calls from FTR to vgES Test FTR working with vgrid extracting vgrid and annotations via new interfaces FTR and vgES code freeze Test demo scenario for October workshop scenario resulting in multiple submissions October all-hands workshop “glue” code freeze SC’07 demo

Milestones Dates, responsibilities and details TBD