Condor in LIGO Festivus-Style Peter Couvares Syracuse University, LIGO Scientific Collaboration Condor Week 2013 1 May 2013.

Slides:

Advertisements

Similar presentations

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Advertisements

Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

Research Issues in Cooperative Computing Douglas Thain

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Wahid Bhimji Andy Washbrook And others including ECDF systems team Not a comprehensive update but what ever occurred to me yesterday.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Performance Evaluation

Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Communicating with Users about HTCondor and High Throughput Computing Lauren Michael, Research Computing Facilitator HTCondor Week 2015.

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Implementing a Central Quill Database in a Large Condor Installation Preston Smith Condor Week April 30, 2008.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

History of the National INFN Pool P. Mazzanti, F. Semeria INFN – Bologna (Italy) European Condor Week 2006 Milan, 29-Jun-2006.

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Metronome and The NMI Lab: This subtitle included solely to.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)

HTCondor and Beyond: Research Computer at Syracuse University Eric Sedore ACIO – Information Technology and Services.

1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.

Session objectives Discuss whether or not virtualization makes sense for Exchange 2013 Describe supportability of virtualization features Explain sizing.

Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

Review of Condor,SGE,LSF,PBS

Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

LIGO-G W Use of Condor by the LIGO Scientific Collaboration Gregory Mendell, LIGO Hanford Observatory On behalf of the LIGO Scientific Collaboration.

LIGO-G Z1 Using Condor for Large Scale Data Analysis within the LIGO Scientific Collaboration Duncan Brown California Institute of Technology.

RALPP Site Report HEP Sys Man, 11 th May 2012 Rob Harper.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

ATLAS Distributed Analysis Dietrich Liko IT/GD. Overview  Some problems trying to analyze Rome data on the grid Basics Metadata Data  Activities AMI.

Next Generation of Apache Hadoop MapReduce Owen

Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison

3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.

Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.

Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.

UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.

Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.

Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.

Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.

Getting the Most out of Scientific Computing Resources

Getting the Most out of Scientific Computing Resources

Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)

CREAM Status and Plans Massimo Sgaravatto – INFN Padova

CREAM-CE/HTCondor site

Building Grids with Condor

US CMS Testbed.

Basic Grid Projects – Condor (Part I)

Presentation transcript:

Condor in LIGO Festivus-Style Peter Couvares Syracuse University, LIGO Scientific Collaboration Condor Week May 2013

Who am I? What is LIGO? Former Condor Team member (‘99-’08). Now at Syracuse University focused on distributed computing problems for the LIGO Scientific Collaboration. LIGO (the Laser Interferometer Gravitational- Wave Observatory) is a large scientific experiment to detect cosmic gravitational waves and harness them for scientific research.

Feats of Strength LIGO involves amazing science, multiple sites, tens of thousands of cores, hundreds of GPUs, hundreds of active users, gigantic DAGS with millions of jobs, PBs of data. LIGO loves Condor. See past Condor Week talks by Duncan Brown, Scott Koranda, myself, and others. But I’m here today to be a PITA…

Airing of Grievances Condor works so well in LIGO that people notice–and mind–when it doesn’t. Luckily I’m not the “Condor IT guy” for all of LIGO but nonetheless many issues end up in my court, and this is my attempt to generalize and pose some questions. Growing pain points for LIGO: – Data-intensive workloads. – Diagnosing failures. – Priority/Policy configuration. – Job factories. – GPUs. – Checkpointing. – Dynamic slots.

Data-Intensive Workloads Old, hard problem. We have multiple big fast NFS servers serving each cluster. Worked well for a long time. Core count increasing faster than NFS throughput. Users increasingly killing fileservers, causing confusion and lost CPU and human time. We may be reaching the limits of centralized NFS, but any i/o system has its limits – can Condor better help us manage the limits we have, and/or deal with the i/o failures we experience?

Diagnosing Failures With larger clusters, more random things go wrong (fileserver outages, user errors, edge-case Condor bugs, etc.). It seems like most of them require a Condor expert to debug. Debugging process is often the same: user complains that their job won’t run, or won’t complete for whatever reason: – Find job in queue, run condor_q –analyze. – Grep SchedLog and NegotiatorLog for relevant info. – Find node it ran on, if it ran. – Grep ShadowLog for relevant info. – Grep StartLog and StarterLog.slotN for relevant info. – Turn up debugging level and ask user to call back when it recurs. – If debugging level was already up, info is gone because logs rolled. :-/ Can’t some of this be automated? – condor_gather_info! Can there be better info propagation from daemons to end-user?

Priority/Policy Configuration I’m not afraid of hairy policy config (see Bologna Batch System), but it would be nice to have some help. SU “Sugar” Pool Policy from 10,000 feet: – Older compute nodes local LIGO users > LIGO users > others > backfill – Newer compute nodes (2064 cores) CPUs: 2/3 LIGO users + 1/3 LHCb users > others > backfill GPUs: local LIGO users > LIGO users > others > backfill – Newest compute nodes (144 cores) CMTG > local LIGO users > others > backfill This takes many screens of clever config file magic to implement. (And may need to be rewritten for dynamic slots.) – Not to mention relative user priority within these groups. Can Condor give us higher-level tools to express this stuff?

Job Factories I keep finding myself in need of a simple job factory. “Watch this LHCb science job queue and submit one LHCb VM job for each science job, up to a limit of 688 VMs. Remove VM jobs as the science-job queue shrinks.” GlideinWMS does all this and more, but is it lightweight enough for simple deployments like this? If so, I need the 5-minute quickstart guide! Should simple factories (for glide-ins, VMs, etc.) be a first-class Condor feature?

GPUs They’re great! They suck! They probably approximate the future: fast, massively multi-core, unreliable. We do a lot of work to manage them in Condor – Querying and advertising h/w attributes. – Identifying and working around failures. Can Condor handle more of this management for us?

Checkpointing DMTCP is the future but… Standard universe still works seamlessly for the restricted cases where it works. DMTCP – checkpoints aside – doesn’t work so seamlessly yet for real use. – No periodic checkpointing, checkpoint/resume, ckpt servers. We keep asking our users to test DMTCP, but… Very hard to get users excited about porting something that works well to something that doesn’t. Bottom line: DMTCP needs to be closer to a drop-in replacement for standard universe before users will be willing to give it a shot.

Dynamic Slots We want them, we need them. Still a PITA for us: – Issues with fetchwork – Issues with GPUs – Porting static-slot policy expressions – Retraining users and rewriting submit file generation code

Followup Todd knows where I live. Better yet, grab me in the hallway. I’m happy to talk to anyone here about ideas you might have for these issues – or about how LIGO uses Condor to get a tremendous amount of work done despite these issues.