A Scalable and Resilient PanDA Service for Open Science Grid Dantong Yu Grid Group RHIC and US ATLAS Computing Facility.

Slides:

Advertisements

Similar presentations

Cultural Heritage in REGional NETworks REGNET Project Meeting Content Group

Advertisements

Performance Testing - Kanwalpreet Singh.

ISV Partner Alliance Value Settings Management User State Virtualization for Microsoft® System Center.

Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Highly Available Central Services An Intelligent Router Approach Thomas Finnern Thorsten Witt DESY/IT.

Distributed components

1 In VINI Veritas: Realistic and Controlled Network Experimentation Jennifer Rexford with Andy Bavier, Nick Feamster, Mark Huang, and Larry Peterson

A Model for Grid User Management Rich Baker Dantong Yu Tomasz Wlodek Brookhaven National Lab.

OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”

BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented.

Performance testing of Progress Appservers and a plug-in for Jmeter

Copyright © 2002 Wensong Zhang. Page 1 Free Software Symposium 2002 Linux Virtual Server: Linux Server Clusters for Scalable Network Services Wensong Zhang.

Load Test Planning Especially with HP LoadRunner >>>>>>>>>>>>>>>>>>>>>>

10/02/2004ELFms meeting1 Linux Virtual Server Miroslav Siket FIO-FS.

22-Aug-15 | 1 |1 | Help! I need more servers! What do I do? Scaling a PHP application.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.

VAP What is a Virtual Application ? A virtual application is an application that has been optimized to run on virtual infrastructure. The application software.

Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.

ISO Layer Model Lecture 9 October 16, The Need for Protocols Multiple hardware platforms need to have the ability to communicate. Writing communications.

Submitted by: Shailendra Kumar Sharma 06EYTCS049.

OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL

Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,

PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.

Tier 1 Facility Status and Current Activities Rich Baker Brookhaven National Laboratory NSF/DOE Review of ATLAS Computing June 20, 2002.

DOSAR Workshop, Sao Paulo, Brazil, September 16-17, 2005 LCG Tier 2 and DOSAR Pat Skubic OU.

OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL

Software Scalability Issues in Large Clusters CHEP2003 – San Diego March 24-28, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, T. Throwe, T. Wlodek RHIC.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.

OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL

® IBM Software Group © 2007 IBM Corporation Best Practices for Session Management

Server to Server Communication Redis as an enabler Orion Free

Remote Site C Pilot Scheduler Pilots and Pilot Schedulers Jobs Statistics Production Dashboard Dynamic Data Movement Monitor Panda Server (Apache) Development.

Plethora: A Wide-Area Read-Write Storage Repository Design Goals, Objectives, and Applications Suresh Jagannathan, Christoph Hoffmann, Ananth Grama Computer.

OSG Tier 3 support Marco Mambelli - OSG Tier 3 Dan Fraser - OSG Tier 3 liaison Tanya Levshina - OSG.

DB Questions and Answers open session Carlos Fernando Gamboa, BNL WLCG Collaboration Workshop, CERN Geneva, April 2008.

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.

 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

The Million Point PI System – PI Server 3.4 The Million Point PI System PI Server 3.4 Jon Peterson Rulik Perla Denis Vacher.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.

Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Module 9 Planning and Implementing Monitoring and Maintenance.

National Energy Research Scientific Computing Center (NERSC) CHOS - CHROOT OS Shane Canon NERSC Center Division, LBNL SC 2004 November 2004.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

General requirements for BES III offline & EF selection software Weidong Li.

OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL May 8 th, 2008.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

OSG Area Coordinator’s Report: Workload Management October 6 th, 2010 Maxim Potekhin BNL

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL

1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.

DBS Monitor and DAN CD Projects Report July 9, 2003.

FroNTier at BNL Implementation and testing of FroNTier database caching and data distribution John DeStefano, Carlos Fernando Gamboa, Dantong Yu Grid Middleware.

DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.

DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.

Jean-Philippe Baud, IT-GD, CERN November 2007

Scaling Network Load Balancing Clusters

LOCO Extract – Transform - Load

Hybrid Cloud Architecture for Software-as-a-Service Provider to Achieve Higher Privacy and Decrease Securiity Concerns about Cloud Computing P. Reinhold.

VIRTUAL SERVERS Presented By: Ravi Joshi IV Year (IT)

Network Requirements Javier Orellana

Overview Introduction VPS Understanding VPS Architecture

Database System Architectures

Presentation transcript:

A Scalable and Resilient PanDA Service for Open Science Grid Dantong Yu Grid Group RHIC and US ATLAS Computing Facility

Motivation and Outline  Project Goal: Build a scalable and resilient PanDA service for OSG  Support multiple OSG VOs and thousands of VO users and jobs.  Reliable, scalable, and high performance.  Cost-effective and flexible deployment.  Optional Solutions for other sites interested in deploying PanDA systems.  Software-based reliability solution (Linux VA).  Use F5 hardware load balancing switch (BNL’s choice).  Vendor support: 24x7  Focus: Evaluate and validate the F5-based solution for PanDA system in terms of reliability and scalability.  A joint effort between Physics Application Group and RACF Grid Computing Group to exercise every component in PanDA system.  In this talk: BNL OSG PanDA architecture, items to be evaluated, and performance results.

Clients PanDA Server Mnt. Server AutoPilot PanDA Server Mnt. Server PanDA DBPanDA Archive … … F5 Server Load Balancing switch rewrites IP header for src. and dest. Addr., IP relay. Clients Virtual Services Physical Servers VIP Reliable/High Performance OSG/ATLAS Job Management Architecture (PanDA) 3

Items to Be Evaluated  Functionality:  PanDA Servers (job submission and jobs dispatching to Pilots).  PanDA Monitoring (users are able to load URLs from VIP, canonical name, or physical IP addresses).  Reliability:  Loss of one service/server will be transparent to users.  Performance Enhancement:  PanDA monitoring server performance test on F5 switch, and comparison with a standalone server (virtual server vs. standalone servers).  PanDA Server performance test on F5 switch, and comparison with a standalone server (virtual server vs. standalone servers).

Composite Results: Transaction Times  Significant improvement in average response time with virtual server over physical servers (3 seconds, vs. 13 seconds and 7 seconds)

Composite Results for PanDA Monitoring  No significant difference between cached access and non-cached access.  Virtual service throughput (in term of processed requests/second) is much better than the aggregated throughput of two physical servers (PanDA monitoring is more efficient with less load). No Cache Cached

Results of Job Submission to PanDA Server  Job submission persists when any server goes off-line.  No apparent boost in scalability, likely due to database bottleneck.

Results of Job Dispatch from PanDA Servers  Job dispatching is not affected by single server failure.  No apparent boost in scalability, either: high activity on database server.

Conclusions and Future Works  Functionality: F5-based virtual services passed all tests.  Reliability: Virtual services survived any single point of hardware failure.  Performance Enhancement: Virtual PanDA monitoring services had significant performance improvement. Performance scales with the number of physical servers. PanDA server itself does not show performance improvement.  Database appears to be the performance bottleneck.  F5 provides flexible, seamless migration from stand-alone services to highly redundant, reliable services.  Future Work:  Continue to develop more robust test suites for PanDA monitoring server and PanDA server (VO independent test suites).  Add more VOs to the PanDA test bed, and allow other VOs to do job submissions and dispatches.

Backup Slides

Outline  Project Responsibilities.  Panda System and the proposed Architecture.  Items to Be Evaluated.  Testbed and Test Environments.  Panda Monitoring Performance Results  Panda Server Performance Results.  Conclusions, Identified Problems, and Future Works.

Responsibilities  Panda System and F5 switch integration (Dantong and John)  Testbed Deployment and Maintenance (Torre, Tadashi, John, and Aaron)  Panda monitoring functionality, reliability, and performance (John DeStefano, Dantong Yu).  Panda server functionality, reliability, and Performance (John Hover and Dantong Yu).  Panda Test suites (Maxim, Torre, Tadashi, John Hover)  Panda database configuration (Yuri and Tomasz).  F5 switch configuration (Frank Burstein)

Testbed Hosts  Panda Virtual Services:  Monitoring server: pandamondev.usatlas.bnl.gov:25880 ( )  Job Submission and Dispatch Server: pandasrvdev.usatlas.bnl.gov:25880 ( )  Panda Physical Servers:  gridui04.usatlas.bnl.gov:25880: Dual CPU with Hyper-threading  gridui08.usatlas.bnl.gov:25880  Dual CPU with eight cores,  Panda Database servers  F5 Big-IP Load Balancer 6400 (Dispatch ratio: 4:1 between two physical servers)

Test Environment  Panda Monitoring Server Tests  Tool: Apache's Jakarta Jmeter  Sequential and Concurrent Requests:  10 concurrent of 1000 total requests per session  URL test string: Cached: DM No cache: DM&reload=yes DM&reload=yes  "no-cache" headers also specified explicitly in test requests.  Panda Server Test  Customized Panda Test Suites to submit user jobs to Panda servers, and dispatch jobs to Pilots.  Concurrent Requests:  Submit 50 jobs and Request to dispatch 50 jobs.

Reference Information  Apache Jakarta JMeter: -reliability-projects/panda-project-description/view

Results: pandamondev (F5)  No cache, concurrent:  Average request time: 3090 ms  Longest request time: 5231 ms Cached, concurrent: Average request time: 2681 ms Longest request time: ms

Results: gridui04  No cache, concurrent: Average request time: 6472 ms  Longest request time: ms Cached, concurrent: Average request time: 6643 ms Longest request time: ms

Results: gridui08  No cache, concurrent: Average request time: ms  Longest request time: ms Cached, concurrent: Average request time: ms Longest request time: ms

Results of Job Submission to Panda Server time./testJobSubmit.py -s gridui08.usatlas.bnl.gov -n 50 real 0m1.057s user 0m0.214s sys 0m0.199s real 0m1.036s user 0m0.215s sys 0m0.206s real 0m0.998s user 0m0.207s sys 0m0.198s real 0m0.956s user 0m0.216s sys 0m0.194s time./testJobSubmit.py -s gridui04.usatlas.bnl.gov -n 50 real 0m1.134s user 0m0.225s sys 0m0.181s real 0m1.095s user 0m0.225s sys 0m0.182s real 0m1.294s user 0m0.222s sys 0m0.193s time./testJobSubmit.py -s pandasrvdev.usatlas.bnl.gov -n 50 real 0m7.252s user 0m0.224s sys 0m0.192s real 0m1.010s user 0m0.208s sys 0m0.204s real 0m1.036s user 0m0.225s sys 0m0.187s real 0m1.066s user 0m0.227s sys 0m0.178s

Results of Job Dispatch from Panda Servers time./testJobDispatch.py -s gridui04.usatlas.bnl.gov -n 50 real 0m15.284s user 0m2.219s sys 0m0.494s real 0m15.314s user 0m2.224s sys 0m0.522s real 0m16.533s user 0m2.223s sys 0m0.495s real 0m15.249s user 0m2.234s sys 0m0.513s time./testJobDispatch.py -s gridui08.usatlas.bnl.gov -n 50 real 0m14.947s user 0m2.234s sys 0m0.512s real 0m14.558s user 0m2.237s sys 0m0.487s real 0m1.095s user 0m0.225s sys 0m0.182s real 0m1.095s user 0m0.225s sys 0m0.182s time./testJobDispatch.py -s pandasrvdev.usatlas.bnl.gov -n 50 real 0m46.099s user 0m2.237s sys 0m0.502s real 0m28.716s user 0m2.216s sys 0m0.494s real 0m14.972s user 0m2.234s sys 0m0.491s real 0m14.851s user 0m2.235s sys 0m0.491s

Conclusions Functionalities: o Panda Servers  Job submission  Jobs dispatching to Pilots (In progress) o Panda Monitoring  Users are able to load URL from both VIP, canonical name, and physical IP addresses).  Reliabilities:  F5 dispatched load based on ratio (4:1) when both services were up.  Turning off one service/server will be transparent to users.  Both servers have one second delay due to F5 test the backend server every second.  Job Submission to Panda server  Pilots pull jobs from Panda server.

Conclusions  Performance Enhancement:  Panda monitoring server performance test on F5 switch and comparison with a standalone server. (virtual server v.s. standalone servers)  10% more success rate for Non-Cache query, and about 20%~50% performance improvements.  Panda Server performance test on F5 switch and comparison with a standalone server. (virtual server v.s. standalone servers)  aren't seeing any boost in scalability, though. We would suspect that the database is the slow link.  We found that there was several second delay when we submit jobs first time after initialization. F5 provides great flexibility to provide seamless migration from the stand-alone services to highly redundant/reliable services. F5 can manage services individually. It can also group multiple services into a single group, and provide high level service.

Conclusions  F5 provides great flexibility to provide seamless migration from the stand-alone services to highly redundant/reliable services.  F5 can manage services individually. It can also group multiple services into a single group, and provide high level service.

Problems Discovered and Fixed  More interactivity with PAS group resulted in problem identification, solutions proposed, and fixed.  Firewall conduits between.54 subnets and.185 subnets (Panda monitoring and dCache) to allow Panda Logger to pull files from dCache.  Panda uses the DQ2 clients on CERN AFS.  Panda monitoring servers slow down in responding to large number of parallel requests.  Code Optimization.  Adjust number of httpd threads and internal Caching.  Database Experienced load spikes during tests.  DBAs worked with developers to optimize slow queries.

Future Works for OSG  Work with Maxim/Torre to develop more robust test suites for Panda monitoring server and Panda server. (VO independent test suites)  Add more VOs into the Panda Testbed and allow other VOs to do job submissions and dispatches.  Panda Server and Panda Monitor Refactoring  A clear tagging/version convention  Module Design  Clear Defined Branches for Each Panda Component.  Easy to Deploy (RPM based) by VO administrators.  Improved documentation for Panda operation and maintenance.