A Scalable and Resilient PanDA Service for Open Science Grid Dantong Yu Grid Group RHIC and US ATLAS Computing Facility.

A Scalable and Resilient PanDA Service for Open Science Grid Dantong Yu Grid Group RHIC and US ATLAS Computing Facility

Motivation and Outline  Project Goal: Build a scalable and resilient PanDA service for OSG  Support multiple OSG VOs and thousands of VO users and jobs.  Reliable, scalable, and high performance.  Cost-effective and flexible deployment.  Optional Solutions for other sites interested in deploying PanDA systems.  Software-based reliability solution (Linux VA).  Use F5 hardware load balancing switch (BNL’s choice).  Vendor support: 24x7  Focus: Evaluate and validate the F5-based solution for PanDA system in terms of reliability and scalability.  A joint effort between Physics Application Group and RACF Grid Computing Group to exercise every component in PanDA system.  In this talk: BNL OSG PanDA architecture, items to be evaluated, and performance results.

Clients PanDA Server Mnt. Server AutoPilot PanDA Server Mnt. Server PanDA DBPanDA Archive … … F5 Server Load Balancing switch rewrites IP header for src. and dest. Addr., IP relay. Clients Virtual Services Physical Servers VIP Reliable/High Performance OSG/ATLAS Job Management Architecture (PanDA) 3

Items to Be Evaluated  Functionality:  PanDA Servers (job submission and jobs dispatching to Pilots).  PanDA Monitoring (users are able to load URLs from VIP, canonical name, or physical IP addresses).  Reliability:  Loss of one service/server will be transparent to users.  Performance Enhancement:  PanDA monitoring server performance test on F5 switch, and comparison with a standalone server (virtual server vs. standalone servers).  PanDA Server performance test on F5 switch, and comparison with a standalone server (virtual server vs. standalone servers).

Composite Results: Transaction Times  Significant improvement in average response time with virtual server over physical servers (3 seconds, vs. 13 seconds and 7 seconds)

Composite Results for PanDA Monitoring  No significant difference between cached access and non-cached access.  Virtual service throughput (in term of processed requests/second) is much better than the aggregated throughput of two physical servers (PanDA monitoring is more efficient with less load). No Cache Cached

Results of Job Submission to PanDA Server  Job submission persists when any server goes off-line.  No apparent boost in scalability, likely due to database bottleneck.

Results of Job Dispatch from PanDA Servers  Job dispatching is not affected by single server failure.  No apparent boost in scalability, either: high activity on database server.

Conclusions and Future Works  Functionality: F5-based virtual services passed all tests.  Reliability: Virtual services survived any single point of hardware failure.  Performance Enhancement: Virtual PanDA monitoring services had significant performance improvement. Performance scales with the number of physical servers. PanDA server itself does not show performance improvement.  Database appears to be the performance bottleneck.  F5 provides flexible, seamless migration from stand-alone services to highly redundant, reliable services.  Future Work:  Continue to develop more robust test suites for PanDA monitoring server and PanDA server (VO independent test suites).  Add more VOs to the PanDA test bed, and allow other VOs to do job submissions and dispatches.

Backup Slides

Outline  Project Responsibilities.  Panda System and the proposed Architecture.  Items to Be Evaluated.  Testbed and Test Environments.  Panda Monitoring Performance Results  Panda Server Performance Results.  Conclusions, Identified Problems, and Future Works.

Responsibilities  Panda System and F5 switch integration (Dantong and John)  Testbed Deployment and Maintenance (Torre, Tadashi, John, and Aaron)  Panda monitoring functionality, reliability, and performance (John DeStefano, Dantong Yu).  Panda server functionality, reliability, and Performance (John Hover and Dantong Yu).  Panda Test suites (Maxim, Torre, Tadashi, John Hover)  Panda database configuration (Yuri and Tomasz).  F5 switch configuration (Frank Burstein)

Testbed Hosts  Panda Virtual Services:  Monitoring server: pandamondev.usatlas.bnl.gov:25880 (130.199.54.61)  Job Submission and Dispatch Server: pandasrvdev.usatlas.bnl.gov:25880 (130.199.54.61)  Panda Physical Servers:  gridui04.usatlas.bnl.gov:25880: Dual CPU with Hyper-threading  gridui08.usatlas.bnl.gov:25880  Dual CPU with eight cores, E5430@2.66GHzE5430@2.66GHz  Panda Database servers  F5 Big-IP Load Balancer 6400 (Dispatch ratio: 4:1 between two physical servers)

Test Environment  Panda Monitoring Server Tests  Tool: Apache's Jakarta Jmeter  Sequential and Concurrent Requests:  10 concurrent of 1000 total requests per session  URL test string: Cached: http://[host:port]/server/pandamon/query?tp=pilots&accepts=BNL_ATLAS_D DM No cache: http://[host:port]/server/pandamon/query?tp=pilots&accepts=BNL_ATLAS_D DM&reload=yes http://[host:port]/server/pandamon/query?tp=pilots&accepts=BNL_ATLAS_D DM&reload=yes  "no-cache" headers also specified explicitly in test requests.  Panda Server Test  Customized Panda Test Suites to submit user jobs to Panda servers, and dispatch jobs to Pilots.  Concurrent Requests:  Submit 50 jobs and Request to dispatch 50 jobs.

Reference Information  Apache Jakarta JMeter: http://jakarta.apache.org/jmeter/index.html https://www.racf.bnl.gov/experiments/usatlas/gridops/panda -reliability-projects/panda-project-description/view http://jakarta.apache.org/jmeter/index.html

Results: pandamondev (F5)  No cache, concurrent:  Average request time: 3090 ms  Longest request time: 5231 ms Cached, concurrent: Average request time: 2681 ms Longest request time: 10674 ms

Results: gridui04  No cache, concurrent: Average request time: 6472 ms  Longest request time: 12427 ms Cached, concurrent: Average request time: 6643 ms Longest request time: 13583 ms

Results: gridui08  No cache, concurrent: Average request time: 13523 ms  Longest request time: 26485 ms Cached, concurrent: Average request time: 12563 ms Longest request time: 25902 ms

Results of Job Submission to Panda Server time./testJobSubmit.py -s gridui08.usatlas.bnl.gov -n 50 real 0m1.057s user 0m0.214s sys 0m0.199s real 0m1.036s user 0m0.215s sys 0m0.206s real 0m0.998s user 0m0.207s sys 0m0.198s real 0m0.956s user 0m0.216s sys 0m0.194s time./testJobSubmit.py -s gridui04.usatlas.bnl.gov -n 50 real 0m1.134s user 0m0.225s sys 0m0.181s real 0m1.095s user 0m0.225s sys 0m0.182s real 0m1.294s user 0m0.222s sys 0m0.193s time./testJobSubmit.py -s pandasrvdev.usatlas.bnl.gov -n 50 real 0m7.252s user 0m0.224s sys 0m0.192s real 0m1.010s user 0m0.208s sys 0m0.204s real 0m1.036s user 0m0.225s sys 0m0.187s real 0m1.066s user 0m0.227s sys 0m0.178s

Results of Job Dispatch from Panda Servers time./testJobDispatch.py -s gridui04.usatlas.bnl.gov -n 50 real 0m15.284s user 0m2.219s sys 0m0.494s real 0m15.314s user 0m2.224s sys 0m0.522s real 0m16.533s user 0m2.223s sys 0m0.495s real 0m15.249s user 0m2.234s sys 0m0.513s time./testJobDispatch.py -s gridui08.usatlas.bnl.gov -n 50 real 0m14.947s user 0m2.234s sys 0m0.512s real 0m14.558s user 0m2.237s sys 0m0.487s real 0m1.095s user 0m0.225s sys 0m0.182s real 0m1.095s user 0m0.225s sys 0m0.182s time./testJobDispatch.py -s pandasrvdev.usatlas.bnl.gov -n 50 real 0m46.099s user 0m2.237s sys 0m0.502s real 0m28.716s user 0m2.216s sys 0m0.494s real 0m14.972s user 0m2.234s sys 0m0.491s real 0m14.851s user 0m2.235s sys 0m0.491s

Conclusions Functionalities: o Panda Servers  Job submission  Jobs dispatching to Pilots (In progress) o Panda Monitoring  Users are able to load URL from both VIP, canonical name, and physical IP addresses).  Reliabilities:  F5 dispatched load based on ratio (4:1) when both services were up.  Turning off one service/server will be transparent to users.  Both servers have one second delay due to F5 test the backend server every second.  Job Submission to Panda server  Pilots pull jobs from Panda server.

Conclusions  Performance Enhancement:  Panda monitoring server performance test on F5 switch and comparison with a standalone server. (virtual server v.s. standalone servers)  10% more success rate for Non-Cache query, and about 20%~50% performance improvements.  Panda Server performance test on F5 switch and comparison with a standalone server. (virtual server v.s. standalone servers)  aren't seeing any boost in scalability, though. We would suspect that the database is the slow link.  We found that there was several second delay when we submit jobs first time after initialization. F5 provides great flexibility to provide seamless migration from the stand-alone services to highly redundant/reliable services. F5 can manage services individually. It can also group multiple services into a single group, and provide high level service.

Conclusions  F5 provides great flexibility to provide seamless migration from the stand-alone services to highly redundant/reliable services.  F5 can manage services individually. It can also group multiple services into a single group, and provide high level service.

Problems Discovered and Fixed  More interactivity with PAS group resulted in problem identification, solutions proposed, and fixed.  Firewall conduits between.54 subnets and.185 subnets (Panda monitoring and dCache) to allow Panda Logger to pull files from dCache.  Panda uses the DQ2 clients on CERN AFS.  Panda monitoring servers slow down in responding to large number of parallel requests.  Code Optimization.  Adjust number of httpd threads and internal Caching.  Database Experienced load spikes during tests.  DBAs worked with developers to optimize slow queries.

Future Works for OSG  Work with Maxim/Torre to develop more robust test suites for Panda monitoring server and Panda server. (VO independent test suites)  Add more VOs into the Panda Testbed and allow other VOs to do job submissions and dispatches.  Panda Server and Panda Monitor Refactoring  A clear tagging/version convention  Module Design  Clear Defined Branches for Each Panda Component.  Easy to Deploy (RPM based) by VO administrators.  Improved documentation for Panda operation and maintenance.

A Scalable and Resilient PanDA Service for Open Science Grid Dantong Yu Grid Group RHIC and US ATLAS Computing Facility.

Similar presentations

Presentation on theme: "A Scalable and Resilient PanDA Service for Open Science Grid Dantong Yu Grid Group RHIC and US ATLAS Computing Facility."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Scalable and Resilient PanDA Service for Open Science Grid Dantong Yu Grid Group RHIC and US ATLAS Computing Facility.

Similar presentations

Presentation on theme: "A Scalable and Resilient PanDA Service for Open Science Grid Dantong Yu Grid Group RHIC and US ATLAS Computing Facility."— Presentation transcript:

Similar presentations

About project

Feedback