Peter Couvares Computer Sciences Department University of Wisconsin-Madison Metronome and The NMI Lab: This subtitle included solely to.

Slides:



Advertisements
Similar presentations
Pegasus on the Virtual Grid: A Case Study of Workflow Planning over Captive Resources Yang-Suk Kee, Eun-Kyu Byun, Ewa Deelman, Kran Vahi, Jin-Soo Kim Oracle.
Advertisements

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Towards a Virtual European Supercomputing Infrastructure Vision & issues Sanzio Bassini
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
Status of Globus activities within INFN (update) Massimo Sgaravatto INFN Padova for the INFN Globus group
Simo Niskala Teemu Pasanen
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Keeping Your Software Ticking Testing with Metronome and the NMI Lab.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
The material in this presentation is the property of Fair Isaac Corporation. This material has been provided for the recipient only, and shall not be used,
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Welcome to CW 2007!!!. The Condor Project (Established ‘85) Distributed Computing research performed by.
Quality Attributes of Web Software Applications – Jeff Offutt By Julia Erdman SE 510 October 8, 2003.
10/20/05 LIGO Scientific Collaboration 1 LIGO Data Grid: Making it Go Scott Koranda University of Wisconsin-Milwaukee.
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
John Kewley e-Science Centre CCLRC Daresbury Laboratory 28 th June nd European Condor Week Milano Heterogeneous Pools John Kewley
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
Peter F. Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
Review of Condor,SGE,LSF,PBS
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Alain Roy Computer Sciences Department University of Wisconsin-Madison Condor & Middleware: NMI & VDT.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
Peter F. Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
2005 GRIDS Community Workshop1 Learning From Cyberinfrastructure Initiatives Grid Research Integration Development & Support
Improving Software with the UW Metronome Becky Gietzel Todd L Miller.
(1) Introduction to Continuous Integration Philip Johnson Collaborative Software Development Laboratory Information and Computer Sciences University of.
(1) Introduction to Continuous Integration Philip Johnson Collaborative Software Development Laboratory Information and Computer Sciences University of.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group
LIGO-G Z1 Using Condor for Large Scale Data Analysis within the LIGO Scientific Collaboration Duncan Brown California Institute of Technology.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.
© 2010 VMware Inc. All rights reserved Why Virtualize? Beng-Hong Lim, VMware, Inc.
Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison
Workload Management Workpackage
Let's talk about Linux and Virtualization in 'vLAMP'
Condor DAGMan: Managing Job Dependencies with Condor
WP2: Infrastructure and Service Management
US CMS Testbed.
X in [Integration, Delivery, Deployment]
JRA 1 Progress Report ETICS 2 All-Hands Meeting
DBOS DecisionBrain Optimization Server
Presentation transcript:

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Metronome and The NMI Lab: This subtitle included solely to steal the “longest title” award from Ewa, who thought she won it this morning with, “Pegasus and DAGMan: From Concept to Execution Mapping Scientific Workflows onto the National Cyberinfrastructure”

CondorProject.org Decision Time › Past Quick Review: why, what, who › Present Current status, new this year › Future Future plans, new next year

CondorProject.org Why: The Problem › Good distributed computing (“grid”) software is… badly needed hard to find hard to build and test

CondorProject.org The Fix (Part of it, anyway) › Good build/test cycle › To be good, build/test process must be… frequent reliable automatic repeatable

CondorProject.org The (Next) Problem › Building and testing distributed computing software requires… Distributed resources Not always in-house, not always dedicated to builds I.e., shared, scheduled resources Unless you have a spare Blue Gene lying around… and an old Alpha running RedHat 7.2… and an HPUX 11 box… and an Itanium running Scientific Linux 3 (CERN-flavored) … and… Distributed testbeds, tests Not: “the grid works on my machine… ship it!”

CondorProject.org Grid Build and Test › Building and testing distributed computing software brings distributed challenges… Complex workflows, cross-site/project/user scheduling priorities, data management, fault- tolerance, failure recovery A lot like “real” distributed computing Tinderbox or the latest Web 2.0 build system doesn’t cut it › Deep, integrated software stacks Distributed providers

CondorProject.org How We Do It › Use proven grid software to build and test new grid software › “Condor works, let’s use Condor” › Metronome is our second-generation build/test framework built on top of Condor, DAGMan, and other distributed computing technologies › NSF-funded

CondorProject.org Metronome Principles › Tool-independent › Lightweight › Encourage explicit, well-controlled build/test environments › Central results repository › Fault-tolerance › Support platform-neutral and platform-specific tasks › Build/test separation

Metronome MySQL Results DB Web Portal Finished Binaries Customer Source Code Condor Queue NMI Build & Test Software Customer Build/Test Scripts INPUT OUTPUT Distributed Build/Test Pool Spe c File DAGMan DAG results build/test jobs DAG results

CondorProject.org NMI Lab Dedicated, heterogeneous distributed computing facility Opposite extreme from typical “cluster” -- instead of 1000’s of identical CPUs, we have a handful of CPUs each for 50+ platforms. Much harder to manage! You try finding a monitoring tool that works on 50 platforms! › Carefully-controlled resources No mystery meat

CondorProject.org The Team › Subset of the Condor Team Becky Gietzel, master of all things NMI Todd Miller, new guy on the block Andy Pavlo, part-timer, short-timer Ken Hahn, sysadmin to the stars Me

CondorProject.org Dogfood and Hats › Eating our own dogfood… Condor builds failed last weekend (true!) Condor developers complained to NMI Lab (“your build system failed… fix it!”) NMI Lab discovered Condor bug (“hmm…”) NMI Lab complained to Condor developers (“your software failed… fix it!”) › Feel the love!

CondorProject.org The Past Year: What We Did on Our Summer Vacation

CondorProject.org New Name! › Before: NMI Build & Test System, NMI Build & Test Software, NMI Build & Test Framework, NMI Software, NMI Build & Test Lab, UW-Madison Build & Test Lab, Build & Test Lab at UW-Madison › After: Metronome + the NMI Lab › Why? Old names were a mouthful Clear separation between the software framework (Metronome) and the facility (the NMI Lab)

CondorProject.org Real Work › Extremely Productive Collaborations TeraGrid: production Metronome deployment using dynamically provisioned resources ETICS, OMII: building higher-level services to generate and manage build/test jobs across an international federation of Metronome deployments › Extremely Productive Users Condor, TeraGrid, Open Science Grid / VDT, Globus, NCSA (MyProxy), SDSC (SRB), LIGO, many others in this room…

CondorProject.org New Metronome Capabilities › “Productization”, customization for other sites › Parallel testing Enables dynamic, co-scheduled, distributed testbeds! › Automatic cross-site job migration Run your own local Metronome pool with access to ours for exotic platforms › Many smaller features and extensions for production users -- users drive development › More bugs fixed than introduced!

CondorProject.org New NMI Lab Capabilities › More platforms “always with the platforms…” new Itanium platforms, NLOTW (New Linux of the Week), additional vendor Unix machines, etc. Now over 50 (!) platforms › Improved Lab Management No, not me… better design and automation of systems & their administration

CondorProject.org Future

CondorProject.org The Plan: Metronome › “Support, maintain, enhance” VM--I mean slot--no wait, I mean VM support Enhanced parallel testing support Custom testbed environments (network, etc.) Dynamic deployments (glide-in) Advanced scheduling policies Scalability testing enhancements Better docs/installation/management

CondorProject.org The Plan: NMI Lab › “Support, maintain, enhance” More platforms, always with the platforms More capacity VM servers for… Root-level testing On-demand platforms Federation with other Metronome labs Better support, smoother management, less downtime New sysadmin starting in June: take a bow, Ross!

CondorProject.org You › Want to use it? › Metronome › The NMI Lab ›

CondorProject.org Feedback › When we started, the state of the art was unimpressive (almost non-existant)… we had to build our own › More build tools now exist -- if you know & like one of them, what do you like about it? › We’d like to better understand what we do well, what we don’t, and how we can integrate with other systems you find useful…