Keeping Your Software Ticking Testing with Metronome and the NMI Lab.

Slides:

Advertisements

Similar presentations

Implementing Tableau Server in an Enterprise Environment

Advertisements

1 Chapter 11: Data Centre Administration Objectives Data Centre Structure Data Centre Structure Data Centre Administration Data Centre Administration Data.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

Monitoring and performance measurement in Production Grid Environments David Wallom.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

INFSO-RI An On-Demand Dynamic Virtualization Manager Øyvind Valen-Sendstad CERN – IT/GD, ETICS Virtual Node bootstrapper.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

Workload Management Massimo Sgaravatto INFN Padova.

Virtualization for Cloud Computing

Is Your IT Out of Alignment? Chargeback and Billing with Parallels Automation Brian Shellabarger, Chief Architect - SaaS.

VAP What is a Virtual Application ? A virtual application is an application that has been optimized to run on virtual infrastructure. The application software.

Ch 4. The Evolution of Analytic Scalability

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

Virtual Infrastructure in the Grid Kate Keahey Argonne National Laboratory.

Model Bank Testing Accelerators “Ready-to-use” test scenarios to reduce effort, time and money.

1 port BOSS on Wenjing Wu (IHEP-CC)

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Metronome and The NMI Lab: This subtitle included solely to.

Bottlenecks: Automated Design Configuration Evaluation and Tune.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

INFSOM-RI Juelich, 10 June 2008 ETICS - Maven From competition, to collaboration.

Peter F. Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Large Scale Sky Computing Applications with Nimbus Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes – Bretagne Atlantique Rennes, France

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.

CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei

The Future of the iPlant Cyberinfrastructure: Coming Attractions.

ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.

Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.

Continuous Integration and Code Review: how IT can help Alex Lossent – IT/PES – Version Control Systems 29-Sep st Forum1.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

TeraGrid CTSS Plans and Status Dane Skow for Lee Liming and JP Navarro OSG Consortium Meeting 22 August, 2006.

Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

EVGM081 Multi-Site Virtual Cluster: A User-Oriented, Distributed Deployment and Management Mechanism for Grid Computing Environments Takahiro Hirofuchi,

Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Applications.

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

Portal Update Plan Ashok Adiga (512)

Rob Davidson, Partner Technology Specialist Microsoft Management Servers: Using management to stay secure.

Alain Roy Computer Sciences Department University of Wisconsin-Madison Condor & Middleware: NMI & VDT.

Peter F. Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.

2005 GRIDS Community Workshop1 Learning From Cyberinfrastructure Initiatives Grid Research Integration Development & Support

Improving Software with the UW Metronome Becky Gietzel Todd L Miller.

GraDS MacroGrid Carl Kesselman USC/Information Sciences Institute.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

TeraGrid’s Common User Environment: Status, Challenges, Future Annual Project Review April, 2008.

INFSO-RI JRA2 Test Management Tools Eva Takacs (4D SOFT) ETICS 2 Final Review Brussels - 11 May 2010.

Next Generation of Apache Hadoop MapReduce Owen

JRA1 Meeting – 09/02/ Software Configuration Management and Integration EGEE is proposed as a project funded by the European Union under contract.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.

TeraGrid Software Integration: Area Overview (detailed in 2007 Annual Report Section 3) Lee Liming, JP Navarro TeraGrid Annual Project Review April, 2008.

Cloud Technology and the NGS Steve Thorn Edinburgh University (Matteo Turilli, Oxford University)‏ Presented by David Fergusson.

Dag Toppe Larsen UiB/CERN CERN,

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Dag Toppe Larsen UiB/CERN CERN,

Docker Birthday #3.

Ch 4. The Evolution of Analytic Scalability

Module 01 ETICS Overview ETICS Online Tutorials

JRA 1 Progress Report ETICS 2 All-Hands Meeting

Presentation transcript:

Keeping Your Software Ticking Testing with Metronome and the NMI Lab

Background: Why (In a Slide!) Grid Software: Important to Science and Industry Quality of Grid Software: Not So Much Testing: Key to Quality Testing Distributed Software: Hard Testing Distributed Software Stacks: Harder Distributed Software Testing Tools: Nonexistent (before) We Needed Help, We Built Something to Help Ourselves and Our Friends, We Think It Can Help Others

Background: What (In a Slide!) A Framework and Tool: Metronome – Lightweight, built atop Condor, DAGMan, and other proven distributed computing tools – Portable, open source – Language/harness independent – Assumes >1 user, >1 project, >1 environment needing resources at >1 site. – Encourages explicit, well-controlled build/test environments for reproducibility – Central results repository – Fault-tolerant – Encourages build/test separation A Facility: The NMI Lab – 200+ cores, 50+ UW (Noah’s Ark; the Anti-Cluster) – Built to use distributed resources at other sites, grids, etc. – 200 users, dozens of registered projects (most of them “real”) – 84k builds & tests managed by 1M Condor jobs, producing 6.5M tracked tasks in the DB A Team – Subset of Condor Team: Becky Gietzel, Todd Miller, Ross Oldenburg, myself. (More coming.) A Community – Working with TeraGrid, OSG, ETICS, others towards a common intl. build/test infrastructure.

MySQL Results DB Web Status Pages Finished Binaries Customer Source Code Condor Queue Metronome Customer Build/Test Scripts INPUT OUTPUT Distributed Build/Test Pool Spec File DAGMan DAG results build/test jobs DAG results Metronome Architecture (In a Slide!)

Why Is This Architecture Powerful? Fault tolerance, resource management. Real scheduler, not a toy or afterthought. Flexible workflow tools. Nothing to deploy in advance on worker nodes except Condor – can harness “unprepared” resources. Advanced job migration capabilities – critical for goal of a common build/test infrastructure across projects, sites, countries.

Example: NMI Lab / ETICS Site Federation with Condor-C

10k Foot View Past: – humble beginnings, ragtag crew of developers making building & testing easier for the projects around them (Condor, Globus, VDT, Teragrid...) Present: – now we have tax money and users should have higher expectations – good news: six months into a new 3y funding cycle, our "professionalism" has improved from our humble beginnings -- better hardware, better processes, better staffing – bad news: we’re still a bit ragtag -- inconsistent support/development request tracking, inconsistent info on resource/lab improvements, issues, and resolution, generally reactive to problems – we're clearly contributing to the build & test capabilities of the community, but we’d like to deliver much more, especially WRT testing.

10k Foot View: Future Maintain Metronome and the NMI Lab – continue to professionalize lab infrastructure, improve availability, stability, uptime – Better monitoring -> more proactive response to issues – Better scheduling of jobs, better use of VMs to respond to uneven x86 platform demand Enhance Metronome and the NMI Lab – New features, new capabilities – but might be less important than clarity, usability, fit & finish of existing features.

10k Foot View: Future Support Metronome and the NMI Lab – more systematic support operation (ticketing, etc.) – more utilization of basic testing capabilities by new users – more utilization of advanced testing capabilities by existing users – more & better information for users, admins, and pointed-haired bosses better reporting on users, resources, usage, operations, etc. Nurture Distributed Software Testing Community – to identify common B&T needs to improve software quality. – to challenge and help us to provide software & services to help meet B&T needs. – Tuesday’s meeting was a good start, I hope…

Maslow’s Pyramid of Testing Needs

Testing Opportunities more resources == more possibilities (just like science) – don’t just test under normal conditions, test the not-so-edge cases too (e.g., with CPU load!) – test everywhere your users run, not just where you develop – old/exotic/unique resources you don’t own (NMI Lab, TeraGrid) “black box” – run your existing tinderbox, etc. test harness inside Metronome decoupled builds & tests – run new tests on old builds – cross-platform binary compatibility testing – run quick smoke tests continuously, heavy tests nightly, performance/scalability tests before release

Testing Opportunities managed (static) vs. “unmanaged” (auto-updating) platforms – isolate your changes from the OS vendors – test your changes against a fixed target – test your working code against a moving target root-level testing automated reports from testing tools – ValGrind, Purify, Coverity, etc. cross-platform binary testing (build on A, test on B)

Testing Opportunities Parameterized dependencies – build with multiple library versions, compilers, etc. – test against every Java VM, Maven, Ant version around – test against different DBs (MySQL, Postgres, Oracle, etc.), VM platforms (Xen, VMWare, etc.), batch systems – make sure new versions of Condor, Globus, etc. don’t break your code Parallel scheduled testbeds – cross-platform testing (A to B) – deploy software stack across many hosts, test whole stack – multi-site testing (US to Europe) – network testing (cross-firewall, low-bandwidth, etc.) – scalability testing

Upshot This is all work we’d like to help this community do. Start small -- automated builds are an excellent start. Think big -- what kinds of testing would pay dividends? Let us know what we can do to help make it happen.