The Panda System Mark Sosebee (for K. De) University of Texas at Arlington dosar workshop March 30, 2006.

Slides:

Advertisements

Similar presentations

Southwest Tier 2 Center Status Report U.S. ATLAS Tier 2 Workshop - Harvard Mark Sosebee for the SWT2 Center August 17, 2006.

Advertisements

McFarm: first attempt to build a practical, large scale distributed HEP computing cluster using Globus technology Anand Balasubramanian Karthik Gopalratnam.

S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003.

DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.

The PanDA Distributed Production and Analysis System Torre Wenaus Brookhaven National Laboratory, USA ISGC 2008 Taipei, Taiwan April 9, 2008 Torre Wenaus.

The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

1 ATLAS DC2 Production …on Grid3 M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04.

Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.

K. De UTA Grid Workshop April 2002 U.S. ATLAS Grid Testbed Workshop at UTA Introduction and Goals Kaushik De University of Texas at Arlington.

PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.

ATLAS DIAL: Distributed Interactive Analysis of Large Datasets David Adams – BNL September 16, 2005 DOSAR meeting.

DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

DOSAR Workshop, Sao Paulo, Brazil, September 16-17, 2005 LCG Tier 2 and DOSAR Pat Skubic OU.

PanDA Summary Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL

ATLAS Data Challenge Production Experience Kaushik De University of Texas at Arlington Oklahoma D0 SARS Meeting September 26, 2003.

David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.

David Adams ATLAS DIAL status David Adams BNL November 21, 2002 ATLAS software meeting GRID session.

My Name: ATLAS Computing Meeting – NN Xxxxxx A Dynamic System for ATLAS Software Installation on OSG Sites Xin Zhao, Tadashi Maeno, Torre Wenaus.

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

US ATLAS Computing Operations Kaushik De University of Texas At Arlington U.S. ATLAS Tier 2/Tier 3 Workshop, FNAL March 8, 2010.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

Nurcan Ozturk University of Texas at Arlington US ATLAS Transparent Distributed Facility Workshop University of North Carolina - March 4, 2008 A Distributed.

Grid Production Experience in the ATLAS Experiment Horst Severini University of Oklahoma Kaushik De University of Texas at Arlington D0-SAR Workshop, LaTech.

A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.

D. Adams, D. Liko, K...Harrison, C. L. Tan ATLAS ATLAS Distributed Analysis: Current roadmap David Adams – DIAL/PPDG/BNL Dietrich Liko – ARDA/EGEE/CERN.

Post-DC2/Rome Production Kaushik De, Mark Sosebee University of Texas at Arlington U.S. Grid Phone Meeting July 13, 2005.

Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,

CHEP 2006, February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome.

Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,

Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

PanDA & BigPanDA Kaushik De Univ. of Texas at Arlington BigPanDA Workshop, CERN October 21, 2013.

Daniele Spiga PerugiaCMS Italia 14 Feb ’07 Napoli1 CRAB status and next evolution Daniele Spiga University & INFN Perugia On behalf of CRAB Team.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures D. Liko, IT/PSS for the ATLAS Distributed Analysis Community.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Shifters Jamboree Kaushik De ADC Jamboree, CERN December 4, 2014.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

ATLAS Distributed Analysis Dietrich Liko IT/GD. Overview  Some problems trying to analyze Rome data on the grid Basics Metadata Data  Activities AMI.

Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.

David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.

Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,

UTA Site Report Jae Yu UTA Site Report 7 th DOSAR Workshop Louisiana State University Apr. 2 – 3, 2009 Jae Yu Univ. of Texas, Arlington.

WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.

PanDA & Networking Kaushik De Univ. of Texas at Arlington ANSE Workshop, CalTech May 6, 2013.

David Adams ATLAS ADA: ATLAS Distributed Analysis David Adams BNL December 15, 2003 PPDG Collaboration Meeting LBL.

Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.

VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

PanDA & Networking Kaushik De Univ. of Texas at Arlington UM July 31, 2013.

ATLAS DDM Developing a Data Management System for the ATLAS Experiment September 20, 2005 Miguel Branco

ATLAS on Grid3/OSG R. Gardner December 16, 2004.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

Supporting Analysis Users in U.S. ATLAS

Computing Operations Roadmap

University of Texas At Arlington Louisiana Tech University

U.S. ATLAS Grid Production Experience

PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL

PanDA in a Federated Environment

Readiness of ATLAS Computing - A personal view

U.S. ATLAS Testbed Status Report

Presentation transcript:

The Panda System Mark Sosebee (for K. De) University of Texas at Arlington dosar workshop March 30, 2006

Mark Sosebee Outline  What is Panda?  Why Panda?  Panda Goals and Expectations  Panda Performance  Panda Design  Summary of Panda Features and Components  Panda Contributors

March 30, 2006 Mark Sosebee What is Panda?  PanDA – Production and Distributed Analysis system  Project started Aug 17, 2005  Baby Panda growing!  New system developed by U.S. ATLAS team  Rapid development from scratch  Leverages DC2/Rome experience  Inspired by Dirac & other systems  Already in use for CSC production in the U.S.  Better scalability/usability compared to DC2 system  Will be available for distributed analysis users in few months  “One-stop shopping” for all ATLAS computing users in U.S.

March 30, 2006 Mark Sosebee Why Panda?  ATLAS used supervisor/executor system for Data Challenge 2 (DC2) and Rome production in  Windmill supervisor common for all grids – developed by KD  U.S. executor (Capone) developed by UC/ANL team  Four other executors were available ATLAS-wide  Large scale production was very successful on the grid  Dozens of different workflows (evgen, G4, digi, pile-up, reco…)  Hundreds of large MC samples produced for physics analysis  DC2/Rome experience led to development of Panda  Operation of DC2/Rome system was too labor-intensive  System could not utilize all available hardware resources  Scaling problems – hard to scale up by required factor of  No distributed analysis system available, no data management

March 30, 2006 Mark Sosebee Original Panda Goals  Minimize dependency on unproven external components  Panda relies only on proven and scalable external components – Apache, Python, MySQL  Backup/failover systems built into Panda for all other components  Local pilot submission backs up CondorG scheduler  GridFTP backs up FTS  Integrated with ATLAS prodsys  Maximize automation – reduce operations crew  Panda operating with smaller (half) shift team and already exceeded DC2/Rome production scale  Expect to scale production rate by factor of 10 without increasing operations support significantly  Additionally, provide support for distributed analysis users (hundreds of users in the U.S.) – without large increase in personnel

March 30, 2006 Mark Sosebee Extensive Error Analysis and Monitoring in Panda Simplifies Operations

March 30, 2006 Mark Sosebee Original Panda Goals (cont.)  Efficient utilization of available hardware  So far, saturated all available T1 and T2 resources during sustained CSC production over February  Available CPU’s are expected to increase by factor of 2-4 soon (~one month): Panda expected to continue with full saturation  Demonstrate scalability required for ATLAS in 2008  Already seen factor of 4 higher peak performance compared to DC2/Rome production  No scaling limits expected for another factor of  Many techniques available if scaling limits reached – based on proven Apache technology, or deployment of multiple Panda servers, or based on proven MySQL cluster technology

March 30, 2006 Mark Sosebee Panda in CSC Production DC2/Rome Production on Grid3 >400 CPU’s

March 30, 2006 Mark Sosebee

March 30, 2006 Mark Sosebee Original Panda Goals (cont. 2)  Tight integration with DQ2 data management system  Panda is the only prodsys executor fully integrated with DQ2  Panda has played a valuable role in testing and hardening DQ2 in real distributed production environment  Excellent synergy between Panda and DQ2 teams  Provide distributed analysis service for U.S. users  Integrated/uniform service for all U.S. grid sites  Variety of tools supported  Scalable system to support hundreds of users

March 30, 2006 Mark Sosebee Panda Design

March 30, 2006 Mark Sosebee Key Panda Features  Service model – Panda runs as an integrated service for all ATLAS sites (currently U.S.) handling all grid jobs (production and analysis)  Task Queue – provides batch-like queue for distributed grid resources (unified monitoring interface for production managers and all grid users)  Strong data management (lesson from DC2) – pre-stage, track and manage every file on grid asynchronously, consistent with DQ2 design  Block data movement – pre-staging of output files is done by optimized DQ2 service based on datasets, reducing latency for distributed analysis (jobs follow the data)  Pilot jobs – are prescheduled to batch systems and grid sites; actual ATLAS job (payload) is scheduled when CPU becomes available, leading to low latency for analysis tasks  Support all job sources – managed or regional production (ATLAS ProdSys), user production (tasks, DIAL, Root, pAthena, scripts or transformations, GANGA…)  Support any site – minimal site requirement: pilot jobs (locally or through grid), outbound http, and integration with DQ2 services

March 30, 2006 Mark Sosebee Panda Components  Job Interface – allows injection of jobs into the system  Executor Interface – translation layer for ATLAS prodsys/prodDB  Task Buffer – keeps track of all active jobs (job state is kept in MySQL)  Brokerage – initiates subscriptions for a block of input files required by jobs (preferentially choose sites where data is already available)  Dispatcher – sends actual job payload to a site, on demand, if all conditions (input data, space and other requirements) are met  Data Service – interface to DQ2 Data Management system  Job Scheduler – send pilot jobs to remote sites  Pilot Jobs – lightweight execution environment to prepare CE, request actual payload, execute payload, and clean up  Logging and Monitoring systems – http and web-based  All communications through REST style HTTPS services (via mod_python and Apache servers)

March 30, 2006 Mark Sosebee defined  assigned  activated  running  finished/failed (If input files are not available: defined  waiting. Once files are available, chain picks up again at “assigned” stage) defined : inserted in panda DB assigned : dispatchDBlock is subscribed to site waiting : input files are not ready activated: waiting pilot requests running : running on a worker node finished : finished successfully failed : failed Job States in the System

March 30, 2006 Mark Sosebee What triggers a change in job status? defined  assigned / waiting: automatic assigned  activated: received a callback which DQ2 site service sends when dispatchDBlock is verified. If a job doesn't have input files, it is activated without a callback. activated  running: picked up by a pilot waiting  assigned: received a callback which DQ2 site service sends when destionationDBlock is verified Job States in the System (cont.)

March 30, 2006 Mark Sosebee Panda Contributors  Project Cordinators: Torre Wenaus – BNL, Kaushik De – UTA  Lead Developer: Tadashi Maeno – BNL  Panda team  Brookhaven National Laboratory (BNL): Wensheng Deng, Alexei Klimentov, Pavel Nevski, Yuri Smirnov, Tomasz Wlodek, Xin Zhao; University of Texas at Arlington (UTA): Nurcan Ozturk, Mark Sosebee; Oklahoma University (OU): Karthik Arunachalam, Horst Severini; University of Chicago (UC): Marco Mambelli; Argonne National Laboratory (ANL): Jerry Gerialtowski; Lawrence Berkeley Lab (LBL): Martin Woudstra  Distributed Analysis team (from Dial): David Adams – BNL, Hyunwoo Kim – UTA  DQ2 developers (CERN): Miguel Branco, David Cameron

March 30, 2006 Mark Sosebee Panda Performance Relative to the Other Grids  How to compare this system to the resources needed to staff equivalent non-PanDA-based production elsewhere in ATLAS? Here is one possibility:  Panda has a single shift crew, 30k CSC jobs completed  NG has a single shift crew, 12k CSC jobs completed, approximately same number of available CPU’s as Panda  LCG (Lexor) has two shift crews now, third one in training, 32k CSC jobs completed  LCG-CG has two shift crews now, third one in training, 51k CSC jobs completed  LCG + LCG-CG has 4-5 times the number of available CPU’s as Panda (shared between two executors)