Analyzing LHC Data on 10K Cores with Lobster and Work Queue Douglas Thain (on behalf of the Lobster Team)

Slides:

Advertisements

Similar presentations

Lobster: Personalized Opportunistic Computing for CMS at Large Scale Douglas Thain (on behalf of the Lobster team) University of Notre Dame CVMFS Workshop,

Advertisements

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

Experience with Adopting Clouds at Notre Dame Douglas Thain University of Notre Dame IEEE CloudCom, November 2010.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

1 Condor Compatible Tools for Data Intensive Computing Douglas Thain University of Notre Dame Condor Week 2011.

An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame

Building Scalable Elastic Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame Tutorial at CCGrid, May Delft, Netherlands.

Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Peter Sempolinski University of Notre Dame.

CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.

Introduction to Makeflow and Work Queue CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

1 port BOSS on Wenjing Wu (IHEP-CC)

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.

BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.

F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;

Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)

CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.

Building Scalable Scientific Applications with Makeflow Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts University.

1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame.

The Cooperative Computing Lab  We collaborate with people who have large scale computing problems in science, engineering, and other fields.  We operate.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.

ALICE Offline Week | CERN | November 7, 2013 | Predrag Buncic AliEn, Clouds and Supercomputers Predrag Buncic With minor adjustments by Maarten Litmaath.

Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.

Introduction to Makeflow and Work Queue Prof. Douglas Thain, University of Notre Dame

High Energy Physics and Grids at UF (Dec. 13, 2002)Paul Avery1 University of Florida High Energy Physics.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness? Douglas Thain, Peter Ivie, and Haiyan Meng.

Introduction to Scalable Programming using Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.

EMI INFSO-RI ARC tools for revision and nightly functional tests Jozef Cernak, Marek Kocan, Eva Cernakova (P. J. Safarik University in Kosice, Kosice,

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.

Building Scalable Elastic Applications using Work Queue Dinesh Rajan and Douglas Thain University of Notre Dame Tutorial at CCGrid, May Delft,

T3g software services Outline of the T3g Components R. Yoshida (ANL)

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

CernVM-FS Infrastructure for EGI VOs Catalin Condurache - STFC RAL Tier1 EGI Webinar, 5 September 2013.

Demonstration of Scalable Scientific Applications Peter Sempolinski and Dinesh Rajan University of Notre Dame.

Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.

Building Scalable Scientific Applications with Work Queue Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts.

Next Generation of Apache Hadoop MapReduce Owen

Joint Institute for Nuclear Research Synthesis of the simulation and monitoring processes for the data storage and big data processing development in physical.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

Considerations on Using CernVM-FS for Datasets Sharing Within Various Research Communities Catalin Condurache STFC RAL UK ISGC, Taipei, 18 March 2016.

Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.

Five todos when moving an application to distributed HTC.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

Using the CMS Higher Level Trigger Farm as a Cloud Resource David Colling Imperial College London.

INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

CernVM-FS vs Dataset Sharing

Blueprint of Persistent Infrastructure as a Service

Belle II Physics Analysis Center at TIFR

US CMS Testbed.

Introduction to Makeflow and Work Queue

Haiyan Meng and Douglas Thain

What’s New in Work Queue

Creating Custom Work Queue Applications

Presentation transcript:

Analyzing LHC Data on 10K Cores with Lobster and Work Queue Douglas Thain (on behalf of the Lobster Team)

The Cooperative Computing Lab

We collaborate with people who have large scale computing problems in science, engineering, and other fields. We operate computer systems on the O(10,000) cores: clusters, clouds, grids. We conduct computer science research in the context of real people and problems. We release open source software for large scale distributed computing. 3

Large Hadron Collider Compact Muon Solenoid Worldwide LHC Computing Grid Many PB Per year Online Trigger 100 GB/s

CMS Group at Notre Dame Sample Problem: Search for events like this: t t H -> τ τ -> (many) τ decays too quickly to be observed directly, so observe the many decay products and work backwards. Was the Higgs Boson generated? (One run requires successive reduction of many TB of data using hundreds of CPU years.) Anna WoodardMatthias Wolf Prof. Hildreth Prof. Lannon

Why not use the WLCG? ND-CMS group has a modest Tier-3 facility of O(300) cores, but wants to harness the ND campus facility of O(10K) cores for their own analysis needs. But, CMS infrastructure is highly centralized – One global submission point. – Assumes standard operating environment. – Assumes unit of submission = unit of execution. We need a different infrastructure to harness opportunistic resources for local purposes.

Condor Pool at Notre Dame

Users of Opportunistic Cycles

Superclusters by the Hour 9

An Opportunity and a Challenge Lots of unused computing power available! And, you don’t have to wait in a global queue. But, machines are not dedicated to you, so they come and go quickly. Machines are not configured for you, so you cannot expect your software to be installed. Output data must be evacuated quickly, otherwise it can be lost on eviction.

Lobster A personal data analysis system for custom codes running on non-dedicated machines at large scale.

Lobster Architecture Lobster Master Output Storage Output Storage CVMFS XRootD Analyze( Dataset, Code ) W W W W W W W W W W W W W W Task Output Chunks Traditional Batch System Output Files Merge Software Archive Data Distribution Network Submit Workers

Nothing Left Behind! Lobster Master Output Storage Output Storage CVMFS XRootD Analyze( Dataset, Code ) Output Chunks Traditional Batch System Output Files Software Archive Data Distribution Network Submit Workers

Task Management with Work Queue

15 Work Queue Library #include “work_queue.h” while( not done ) { while (more work ready) { task = work_queue_task_create(); // add some details to the task work_queue_submit(queue, task); } task = work_queue_wait(queue); // process the completed task }

Work Queue Applications Nanoreactor MD Simulations Adaptive Weighted Ensemble Scalable Assembler at Notre Dame ForceBalance

Lobster Master Application Local Files and Programs Work Queue Architecture Worker Process Cache Dir A C B Work Queue Master Library 4-core machine Task.1 Sandbox A B T 2-core task Task.2 Sandbox C A T 2-core task Send files Submit Task1(A,B) Submit Task2(A,C) A B C Submit Wait Send tasks

Private Cluster Campus Condor Pool Public Cloud Provider Shared SGE Cluster Lobster Master Work Queue Master Run Workers Everywhere sge_submit_workers W W W ssh WW W W W W W condor_submit_workers W W W Thousands of Workers in a Personal Cloud submit tasks Local Files and Programs A B C

Scaling Up to 20K Cores Michael Albrecht, Dinesh Rajan, Douglas Thain, Making Work Queue Cluster-Friendly for Data Intensive Scientific Applications, IEEE International Conference on Cluster Computing, September, 2013.DOI: /CLUSTER DOI: /CLUSTER Lobster Master Application Work Queue Master Library Submit Wait Foreman $$$ 16-core Worker $$$ 16-core Worker $$$ 16-core Worker $$$ Local Files and Programs A B C

Choosing the Task Size Setup 100 Event Task OUT Setup 100 Event Task Setup OUT 100 Event Task Setup OUT Setup 200 Event Task Small Tasks: High Overhead, low cost of failure, high cost of merging. Large Tasks: Low overhead, high cost of failure, low cost of merging.

Ideal Task Size Max Efficiency Trace Driven Simulation

Software Delivery with Parrot and CVMFS

CMS Application Software Carefully curated and versioned collection of analysis software, data access libraries, and visualization tools. Several hundred GB of executables, compilers, scripts, libraries, configuration files… User expects: How can we deliver the software everywhere? export CMSSW /path/to/cmssw $CMSSW/cmsset_default.sh export CMSSW /path/to/cmssw $CMSSW/cmsset_default.sh

Parrot Virtual File System Unix Appl Parrot Virtual File System Local iRODS Chirp HTTP CVMFS Capture System Calls via ptrace /home = /chirp/server/myhome /software = /cvmfs/cms.cern.ch/cmssoft Custom Namespace File Access Tracing Sandboxing User ID Mapping... Parrot runs as an ordinary user, so no special privileges required to install and use. Makes it useful for harnessing opportunistic machines via a batch system.

Parrot + CVMFS www server www server CMS Task CMS Task Parrot squid proxy squid proxy squid proxy squid proxy squid proxy squid proxy CVMFS Driver meta data data meta data data CAS Cache CMS Software 967 GB 31M files Content Addressable Storage Build CAS HTTP GET

Parrot + CVMFS Global distribution of a widely used software stack, with updates automatically deployed. Metadata is downloaded in bulk, so directory operations are all fast and local. Only the subset of files actually used by an applications are downloaded. (Typically MB) Data sharing at machine, cluster, and site. Jakob Blomer, Predrag Buncic, Rene Meusel, Gerardo Ganis, Igor Sfiligoi and Douglas Thain, The Evolution of Global Scale Filesystems for Scientific Software Distribution, IEEE/AIP Computing in Science and Engineering, 17(6), pages 61-71, December, 2015.

Lobster in Production

The Good News Typical daily production runs on 1K cores. Largest runs: 10K cores on data analysis jobs, and 20K cores on simulation jobs. One instance of Lobster at ND is larger than all CMS Tier-3s, and 10% of the CMS WLCG. Lobster isn’t allowed to run on football Saturdays – too much network traffic! Anna Woodard, Matthias Wolf, Charles Mueller, Nil Valls, Ben Tovar, Patrick Donnelly, Peter Ivie, Kenyi Hurtado Anampa, Paul Brenner, Douglas Thain, Kevin Lannon and Michael Hildreth, Scaling Data Intensive Physics Applications to 10k Cores on Non-Dedicated Clusters with Lobster, IEEE Conference on Cluster Computing, September, 2015.

Running on 10K Cores

Competitive with CSA14 Activity

The Hard Part: Debugging and Troubleshooting Output archive would mysteriously stop accepting output for >1K clients. Diagnosis: Hidden file descriptor limit. Entire pool would grind to a halt a few times per day. Diagnosis: One failing HDFS node in an XRootD node at the University of XXX. Wide are network outage would cause massive fluctuations as workers start/quit. (Robustness can be dangerous!)

Monitoring Strategy Output Archive Output Archive CVMFS XRootD W W W W W W W W W W W W W W Task Traditional Batch System Software Archive Data Distribution Network Lobster Master Monitor DB Monitor DB wqidle15s wqinput2.3s setup3.5s stagein10.1s scram5.9s run3624s wait65s stageout92s wqooutwait7s wqoutput 2s setup3.5s stagein10.1s scram5.9s run3624s wait65s stageout92s Performance Observed By Task

Problem: Task Oscillations

Diagnosis: Bottleneck in Stage-Out

Good Run on 10K Cores

Lessons Learned Distinguish between the unit of work and the unit of consumption/allocation. Monitor resources from the application’s perspective, not just the system’s perspective. Put an upper bound on every resource and every concurrent operation. Where possible, decouple the consumption of different resources. (e.g. Staging/Compute)

Acknowledgements 37 Center for Research Computing Paul Brenner Sergeui Fedorov CCL Team Ben Tovar Peter Ivie Patrick Donnelly Notre Dame CMS Team Anna Woodard Matthias Wolf Chales Mueller Nil Valls Kenyi Hurtado Kevin Lannon Michael Hildreth HEP Community Jakob Blomer – CVMFS David Dykstra - Frontier NSF Grant ACI : “Connecting Cyberinfrastructure with the Cooperative Computing Tools”

The Lobster Data Analysis System The Cooperative Computing Lab Prof. Douglas Thain