1 October 2013 APF Summary Oct 2013 John Hover John Hover.

Slides:

Advertisements

Similar presentations

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Advertisements

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.

Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

Futures – Alpha Cloud Deployment and Application Management.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Computing Lectures Introduction to Ganga 1 Ganga: Introduction Object Orientated Interactive Job Submission System –Written in python –Based on the concept.

A Computation Management Agent for Multi-Institutional Grids

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Testing PanDA at ORNL Danila Oleynik University of Texas at Arlington / JINR PanDA UTA 3-4 of September 2013.

Professional Informatics & Quality Assurance Software Lifecycle Manager „Tools that are more a help than a hindrance”

Minerva Infrastructure Meeting – October 04, 2011.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.

Raffaele Di Fazio Connecting to the Clouds Cloud Brokers and OCCI.

Eric Westfall – Indiana University James Bennett – Indiana University ADMINISTERING A PRODUCTION KUALI RICE INFRASTRUCTURE.

Grid Initiatives for e-Science virtual communities in Europe and Latin America DIRAC TEAM CPPM – CNRS DIRAC Grid Middleware.

1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.

| nectar.org.au NECTAR TRAINING Module 5 The Research Cloud Lifecycle.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,

Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.

Condor Birdbath Web Service interface to Condor

1 Overview of the Application Hosting Environment Stefan Zasada University College London.

Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.

Grid job submission using HTCondor Andrew Lahiff.

Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.

AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.

| nectar.org.au NECTAR TRAINING Module 5 The Research Cloud Lifecycle.

ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart

Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

K. Harrison CERN, 22nd September 2004 GANGA: ADA USER INTERFACE - Ganga release status - Job-Options Editor - Python support for AJDL - Job Builder - Python.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

Hyperion Artifact Life Cycle Management Agenda  Overview  Demo  Tips & Tricks  Takeaways  Queries.

OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL May 8 th, 2008.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

T3g software services Outline of the T3g Components R. Yoshida (ANL)

OpenStack Chances and Practice at IHEP Haibo, Li Computing Center, the Institute of High Energy Physics, CAS, China 2012/10/15.

HTCondor-CE for USATLAS Bob Ball AGLT2/University of Michigan OSG AHM March, 2015 Bob Ball AGLT2/University of Michigan OSG AHM March, 2015.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.

CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.

Information Initiative Center, Hokkaido University North 11, West 5, Sapporo , Japan Tel, Fax: General.

Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.

1 11 March 2013 John Hover Brookhaven Laboratory Cloud Activities Update John Hover, Jose Caballero US ATLAS T2/3 Workshop Indianapolis, Indiana.

Parag Mhashilkar (Fermi National Accelerator Laboratory)

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

Elastic Computing Resource Management Based on HTCondor

ATLAS Cloud Operations

Cloud Management Mechanisms

Oracle Solaris Zones Study Purpose Only

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Cloud Management Mechanisms

Presentation transcript:

1 October 2013 APF Summary Oct 2013 John Hover John Hover

2 October 2013 APF Summary Overview AutoPyFactory (APF) is a job factory. It monitors amount of work ready to perform in a source system. It monitors the jobs already submitted to a destination system (running or pending). Each cycle (typically 6 minutes) APF: Calculates the proper number of jobs to submit Submits them. Optionally posts info to external monitoring system. For the ATLAS Panda pilot-based Workload Management System: It queries for activated jobs, by WMS queue. Submits pilot wrappers to sites. One APF queue per Panda queue/site. For virtual cluster management: Queries local batch for idle jobs. Submits VM requests to Cloud platform. VM connects back to local cluster to run job.

3 October 2013 APF Summary General Design

4 October 2013 APF Summary Features Scalable and Robust Single-process, multi-threaded daemon. Written in Python with proper error handling. Most core plugins built on HTCondor, a well-known batch/scheduling platform.Extensible Plugin architecture allows future expansion. Already being used for purposes not originally forseen.Flexible Highly configurable. All components are designed to be mixed to serve future purposes. Easily Deployable Cleanly packaged (RPM) and integrated as a typical Linux service. Upgradeable via simple package update. Conforms to systems administrator expectations.

5 11 Oct 2011 John Hover USATLAS Workshop Internals Heavily multi-threaded Heavily multi-threaded – Failures/timeouts in one section should not affect others. Each APF internal queue works independently. – A single process simplfies global coordination. Fully modular Fully modular – All functionality handled by self-contained object with defined interface for usage by other objects. – Overall system now candidate for embedding, i.e. we run it as a daemon from init now, but it could also be instantiated as a web service with a web GUI. Plugin architecture Plugin architecture – Allows easy configuration, extension, and customization.

6 11 Oct 2011 John Hover USATLAS Workshop Plugins WMS Status (Panda) Plugin WMS Status (Panda) Plugin – Queries WMS for its queue config and current state, e.g. how many jobs activated? Running? – E.g. Panda, LocalCondor Batch Status Plugin Batch Status Plugin – Queries local batch system (e.g. Condor-G or Condor) for submitted job state info (pending, failed, submitted, etc.) Batch Submit Plugin Batch Submit Plugin – Creates the submit file and issues batch submit command(s). – E.g. CondorGT2, CondorLocal, EC2 Sched Plugins Sched Plugins – Decides exactly how many pilots to submit each cycle. – E.g. Fixed, Activated, NQueue, KeepNRunning, Scale, MinPerCycle

7 13 June 2013 S&C Modular algorithms APF provides fine-grained, scheduling plugins Used to calculate how many jobs to submit/retire each cycle. Output of earlier fed into later. Answer from last. E.g.: schedplugin = Ready, Scale, StatusTest, MaxPerCycle, MinPerCycle, MaxPending, MaxTorun, StatusOffline sched.ready.offset = 100 sched.scale.factor =.25 sched.minpercycle.minimum = 0 sched.maxpercycle.maximum = 100 sched.maxpending.maximum = 25 sched.maxtorun.maximum = 250 Mix-and-matchable with any set of status/submit plugins (grid, local, EC2).

8 11 Oct 2011 John Hover USATLAS Workshop Self-Contained Built-in web server for batch log export Built-in web server for batch log export – No more separate Apache setup. – Allows other information to be exported for public view. Built-in in-process batch log cleanup Built-in in-process batch log cleanup – Configurable by time and/or disk % usage. Integrated Proxy Management Integrated Proxy Management – Allows for multiple proxy types, with each queue specifying which to use. – Allows specification of a list of certificates for fail-over. If one has expired, it generates a proxy with the next. – Allows for use of long-lived base vanilla proxy. No requirement for clear text passwords on system.

9 11 Oct 2011 John Hover USATLAS Workshop Usable to provision Cloud-based clusters

10 13 Nov 2012 John Hover VM Lifecycle Instantiation – When we want to expand the resource, a VM is instantiated, for as long as there is sufficient work to keep it busy.Association – In order to manage the lifecycle, we track the association between a particular VM and a particular batch cluster machine. (I.e. We cannot tell which VMs are running jobs from the Cloud API.) – This is done via embedded DB (with Euca tools) or a ClassAd attribute (Condor-G)Retirement – When we want to contract the cluster. APF tells each batch slot on a VM to retire, i.e. finish its current job but accept no more.Termination – Once all batch slots on a VM are idle. The VM is terminated.

11 13 Nov 2012 John Hover Cloud interactions APF uses a plugin architecture to use the Cloud APIs on EC2 and Openstack. – Condor plugin supports both EC2 and Openstack via Condor-G. APF supports hierarchies and weighting – We can establish multiple Cloud resources in order of preference, and expand and contract preferentially, e.g., Local Openstack (free, local) Another ATLAS facility Openstack (free, remote) Academic cloud at another institution (free, remote) Amazon EC2 via spot pricing. (cheap, remote) Amazon EC2 via guaranteed instance. (costly, remote) – Weighting means supporting a scaling factor between number of waiting jobs and number of slots created. E.g. create 1 VM for every 10 Activated jobs waiting.

12 October 2013 APF Summary External Monitor Project: