Peter Couvares Computer Sciences Department University of Wisconsin-Madison Metronome and The NMI Lab: This subtitle included solely to steal the “longest title” award from Ewa, who thought she won it this morning with, “Pegasus and DAGMan: From Concept to Execution Mapping Scientific Workflows onto the National Cyberinfrastructure” Decision Time › Past Quick Review: why, what, who › Present Current status, new this year › Future Future plans, new next year Why: The Problem › Good distributed computing (“grid”) software is… badly needed hard to find hard to build and test The Fix (Part of it, anyway) › Good build/test cycle › To be good, build/test process must be… frequent reliable automatic repeatable The (Next) Problem › Building and testing distributed computing software requires… Distributed resources Not always in-house, not always dedicated to builds I.e., shared, scheduled resources Unless you have a spare Blue Gene lying around… and an old Alpha running RedHat 7.2… and an HPUX 11 box… and an Itanium running Scientific Linux 3 (CERN-flavored) … and… Distributed testbeds, tests Not: “the grid works on my machine… ship it!” Grid Build and Test › Building and testing distributed computing software brings distributed challenges… Complex workflows, cross-site/project/user scheduling priorities, data management, fault- tolerance, failure recovery A lot like “real” distributed computing Tinderbox or the latest Web 2.0 build system doesn’t cut it › Deep, integrated software stacks Distributed providers How We Do It › Use proven grid software to build and test new grid software › “Condor works, let’s use Condor” › Metronome is our second-generation build/test framework built on top of Condor, DAGMan, and other distributed computing technologies › NSF-funded Metronome Principles › Tool-independent › Lightweight › Encourage explicit, well-controlled build/test environments › Central results repository › Fault-tolerance › Support platform-neutral and platform-specific tasks › Build/test separation
Metronome MySQL Results DB Web Portal Finished Binaries Customer Source Code Condor Queue NMI Build & Test Software Customer Build/Test Scripts INPUT OUTPUT Distributed Build/Test Pool Spe c File DAGMan DAG results build/test jobs DAG results NMI Lab Dedicated, heterogeneous distributed computing facility Opposite extreme from typical “cluster” -- instead of 1000’s of identical CPUs, we have a handful of CPUs each for 50+ platforms. Much harder to manage! You try finding a monitoring tool that works on 50 platforms! › Carefully-controlled resources No mystery meat The Team › Subset of the Condor Team Becky Gietzel, master of all things NMI Todd Miller, new guy on the block Andy Pavlo, part-timer, short-timer Ken Hahn, sysadmin to the stars Me Dogfood and Hats › Eating our own dogfood… Condor builds failed last weekend (true!) Condor developers complained to NMI Lab (“your build system failed… fix it!”) NMI Lab discovered Condor bug (“hmm…”) NMI Lab complained to Condor developers (“your software failed… fix it!”) › Feel the love! The Past Year: What We Did on Our Summer Vacation New Name! › Before: NMI Build & Test System, NMI Build & Test Software, NMI Build & Test Framework, NMI Software, NMI Build & Test Lab, UW-Madison Build & Test Lab, Build & Test Lab at UW-Madison › After: Metronome + the NMI Lab › Why? Old names were a mouthful Clear separation between the software framework (Metronome) and the facility (the NMI Lab) Real Work › Extremely Productive Collaborations TeraGrid: production Metronome deployment using dynamically provisioned resources ETICS, OMII: building higher-level services to generate and manage build/test jobs across an international federation of Metronome deployments › Extremely Productive Users Condor, TeraGrid, Open Science Grid / VDT, Globus, NCSA (MyProxy), SDSC (SRB), LIGO, many others in this room… New Metronome Capabilities › “Productization”, customization for other sites › Parallel testing Enables dynamic, co-scheduled, distributed testbeds! › Automatic cross-site job migration Run your own local Metronome pool with access to ours for exotic platforms › Many smaller features and extensions for production users -- users drive development › More bugs fixed than introduced! New NMI Lab Capabilities › More platforms “always with the platforms…” new Itanium platforms, NLOTW (New Linux of the Week), additional vendor Unix machines, etc. Now over 50 (!) platforms › Improved Lab Management No, not me… better design and automation of systems & their administration Future The Plan: Metronome › “Support, maintain, enhance” VM--I mean slot--no wait, I mean VM support Enhanced parallel testing support Custom testbed environments (network, etc.) Dynamic deployments (glide-in) Advanced scheduling policies Scalability testing enhancements Better docs/installation/management The Plan: NMI Lab › “Support, maintain, enhance” More platforms, always with the platforms More capacity VM servers for… Root-level testing On-demand platforms Federation with other Metronome labs Better support, smoother management, less downtime New sysadmin starting in June: take a bow, Ross! You › Want to use it? › Metronome › The NMI Lab › Feedback › When we started, the state of the art was unimpressive (almost non-existant)… we had to build our own › More build tools now exist -- if you know & like one of them, what do you like about it? › We’d like to better understand what we do well, what we don’t, and how we can integrate with other systems you find useful…