Xin Zhao, BNL

Panda: US ATLAS Production and Distributed Analysis System Xin Zhao Brookhaven National Laboratory

Xin Zhao, BNL xzhao@bnl.gov
Outline Panda background Design and key features Core components Panda Server DDM in Panda JobScheduler and Pilots Panda in production Conclusion and more information Xin Zhao, BNL November 18, 2018

What is Panda? Panda – Production ANd Distributed Analysis system ATLAS prodsys executor for OSG (diagram next page) “One stop shopping” for all ATLAS users in the U.S. managed ATLAS production jobs, regional/group/user production jobs, distributed analysis jobs Xin Zhao, BNL November 18, 2018

ProdDB ATLAS Prodsys DMS (DQ2) super super super NG exe LCG exe OSG exe “Panda” CE CE SE/RLS SE/RLS SE/RLS CE Xin Zhao, BNL November 18, 2018

What is Panda? (cont’d) Written in Python, development team is from BNL, UTA, UC, OU,ANL and LBL, led by Torre Wenaus (BNL) and Kaushik De (UTA) Started August 2005 full redesign based on previous DC2/Rome production experience, to achieve performance, scalability, ease of operation needed for ATLAS datataking (up to K jobs/day) In production since Dec. 2005 Ambitious development milestones met Still in rapid development Xin Zhao, BNL November 18, 2018

Panda Design and key features
Architecture (diagram on next page) Core components Panda Server: Job brokerage and dispatcher Data Management System (DDM) : ATLAS data management services running on top of Grid SEs JobScheduler and Pilots : Acquisition of Grid CE resources Key features --- “pull model” Data-driven workflow, tightly integrated with ATLAS data management system (Don Quijote 2): pull data to the SE of targeted site Late binding of jobs to worker nodes via “pilot job” scheme : pull job payload to acquired CE worker nodes Data movement (stage-in and stage-out) is decoupled from job processing Xin Zhao, BNL November 18, 2018

Panda Architecture Xin Zhao, BNL November 18, 2018

Core components (I): Panda Server
Apache-based Communication via HTTP/HTTPS Multi-process Global info in the memory resident database No dependence on special grid middleware Apache (mod_python) Job info, etc client Job submitter Pilot DQ2 callback Monitor child process DB DQ2 Python interpreter HTTP/HTTPS MySQL API Xin Zhao, BNL November 18, 2018

Panda Server (cont’d) MySQL cluster backend Memory-resident MySQL database is used for recording current/recent Job processing activities Longer term information is stored in disk-resident DB Brokerage Manage where jobs (and associated data) are sent based on job characteristics, data locality, priorities,user/group role, site resources & capacities matched to job needs Need improve dynamic site information gathering via OSG information systems Manage data/work flow pipeline Ask DDM to ‘dispatch’ dataset associated with set of jobs to a site Received notification of transfer completion, and then release jobs Received notification of job completion, ask DDM to transfer outputs to destination Dispatcher dispatches released jobs to sites upon pilots request Xin Zhao, BNL November 18, 2018

Core components (II): DDM
Don Quijote 2 is the data management system for ATLAS Supports all three ATLAS Grid flavors (OSG, LCG and NorduGrid) Supports all file based data (event data, conditions, …) Manage all data flows, EF -> T0 -> Grid Tiers -> Institutes -> laptops, end users DQ2 architecture (diagram next page) Central catalog services Local site services Xin Zhao, BNL November 18, 2018

DQ2 Architecture Xin Zhao, BNL November 18, 2018

DDM (cont’d) Principal features Bulk operation supported data movement and lookup is done in “dataset” and “datablock” Dataset: collections of logical files Datablock: immutable dataset, specifically designed for replication and global data discovery Data movement is triggered by “subscription” Client subscription specifies which data to be transferred to which sites DQ2 local site service finds subscriptions and “pulls” the requested data to local site Scalable global data discovery and access via catalog hierarchy Physical file information available and managed locally only Use GSI authentication, supports SRM (v1.1), GridFTP, glite FTS, HTTP, cp Xin Zhao, BNL November 18, 2018

DDM (cont’d) DDM in Panda (diagram next page): Data management is done asynchronously with job processing Decouple data movement issues with job failures : lack of reliable, robust data movement/storage services is one of the major reasons for job failures, as seen from previous Data Challenges Allow “just in time” job launch and execution, good for latency sensitive jobs, e.g. distributed analysis Issue DQ2 site service is a prerequisite for a site to run panda jobs Service is hard to install, involves lots of manual steps done by site admin, right now only USATLAS sites have it Use OSG edge service box later Or make one site service serve several sites Xin Zhao, BNL November 18, 2018

Transfer output to destination
DDM (cont’d) DQ2 based data handling in Panda broker Datasets catalog service Panda server DB subscription dispatcher callback Subscription management Local sites DB DB DQ2 site service DQ2 site service CE dCache HPSS SE Transfer output to destination Site B (e.g. BNL) Site A Xin Zhao, BNL November 18, 2018

Core components (III): JobScheduler and Pilots
Panda interface with Grid CE resources Acquisition of CPU slots is pre-scheduled before job payload is available JobScheduler: send “pilots” to sites constantly via Condor-G Pilot: “pull” job payload from panda server to the CPU A “CPU slot” holder for the panda server An ordinary batch job for the local batch system A sandbox for one or multiple real ATLAS jobs Workflow (diagram on next page) Xin Zhao, BNL November 18, 2018

Workflow of Panda jobscheduler and pilot
Panda job dispatcher OSG Site information https Pilot job running on remote sites Scheduler Ask dispatcher for a new job to Grid Set up runtime environment Submit process Condor Schedd Condor-G Input from local DDM to workdir Fork real job and monitor it Output from workdir to local DDM Final status update to dispatcher Workflow of Panda jobscheduler and pilot https srmcp dccp Local DDM site service Xin Zhao, BNL November 18, 2018

JobScheduler Implementation Condor-G based Common “Grid Scheduler” today Comes with some good features, e.g. GridMonitor, which reduces load on gatekeeper and has become a standard practice on OSG. Infinite loop to send pilots to all usable sites at a steady rate (like the condor GridExerciser application) Currently it submits two types of pilots: production pilots and user analysis pilots The submission rate is configurable, currently 5 pilots/3 minutes, and always keeps 30 queued pilot jobs on each remote site Static CE site information, e.g. $APP, $DATA…, collected from OSG information services Automate it later, and share a unique site information database cross all panda components (ref. slide 9) Xin Zhao, BNL November 18, 2018

JobScheduler (cont’d)
Scalability Issue Goal: to run ~100k jobs/day support 2000 jobs (running plus pending) at any time From Condor-G developers: doable at supporting 500~1000 jobs (running plus pending) per site from one submit host USATLAS production has never reached that range with the current available CE resources Multiple submit host Extra operational manpower Local submission a fallback, can bypass the whole cross-domain issue needs more involvement from site administrators, and difficult for shifters to maintain remotely, unless OSG provides edge service boxes Xin Zhao, BNL November 18, 2018

Pilot Implementation Connects to panda server through https for retrieving new jobs and update job status outbound connectivity (or through a proxy server) from worker nodes is required on CE Set up job work directory on worker node local disk (OSG $WNTMP) Call DDM local site service for staging-in and staging-out data between worker node local disk and CE local storage system Lease-based fault-tolerance algorithm: job heartbeat messages sent to panda server every 30 minutes; panda server fails a job and re-submit it if no updates in 6 hours Debugging and logging: make a tarball of workdir and save it into DDM, for all jobs (finished and failed) Doesn’t consume CPU if no real job is available (exit immediately from worker node) Xin Zhao, BNL November 18, 2018

General issues in Jobscheduler and pilot scheme
Security and authentication concerns Late binding of job payload with worker node confuses site accounting/auditing JobScheduler operator’s certificate is used in authentication/authorization with remote Grid sites, as it pushes pilots into CEs Real job submitters’ jobs go directly to pilots without going through site authentication/authorization with his/her own certificates Forwarding of certificate/identity to switch worker node process identities to match real users Collaborate with CMS/CDF and other OSG groups Xin Zhao, BNL November 18, 2018

General issues (cont’d)
Latency sensitive DA(distributed analysis) jobs Panda reduces latency by bypassing the usual obstacles of acquisition of SE/CE resources Pre-stage input data into CE local storage using DQ2 Late-binding of job payload with CPU slots (pilots) Allocation of pilots slots between production and DA jobs Currently 10% of the pilots are allocated to DA users by JobScheduler No guarantee of a “steady,adequate” DA pilots stream to panda server “soft” allocation, controlled at panda level, not directly on the batch system level on CEs Problem occurs when long-running production jobs (walltime ~2 days) occupies all available CPUs --- No new pilot requests at all, any pre-slot-allocation doesn’t work Xin Zhao, BNL November 18, 2018

Issues (cont’d): DA pilots delivery
Alternative approach I --- short queue DA jobs are short (~ 1 hour at most) This is why/how it can ask for low latencies Dedicated “short” queue for DA jobs Traditional batch system model, like a “HOV” lane on the highway Average rate for DA pilot stream at ~ <job walltime>/<# of CPUs> Drawback: “short” queue could stay “idle” if no enough DA jobs to run Shared with other VO jobs? Policy could change from site to site Requirement for a CE site configuration, to be deployed firstly on ATLAS owned resources Xin Zhao, BNL November 18, 2018

DA Pilot Delivery (cont’d)
Alternative II ---Multitasking pilot It runs both a long production job and a short analysis job in parallel Asks for new analysis jobs one after another till the production job finishes, then release the CPU resource Production job could be checkpointed and suspended, but is not the initial approach Panda Server retrieve/ update info Monitor thread pilot fork/exec Monitor cleanup prod job anal job anal job anal job Job Status update Xin Zhao, BNL November 18, 2018

Multitasking pilot (cont’d)
Alternative approach II --- multitasking pilot “Virtual short queue” under “pilots pool” Very short latency, always ready for DA job pickup No special configuration required on CE, But, still concerns: Resource contention on worker nodes, particularly memory Atlas jobs usually consume a lot of memory (hundreds of MB) Conflict with resource usage policies, break “fair” sharing of local batch systems with other users’ jobs Firstly try on ATLAS owned resources Xin Zhao, BNL November 18, 2018

General issues (cont’d)
We are actively testing both “short queue” and “multitasking pilot” approaches now on USATLAS sites, with performance comparison, cost evaluation… Collaborate with Condor and CMS on “just-in-time workload management” Exploring common land in the context of OSG’s planned program of “middleware extentions” development. Condor has many functionalities in place, like “multitasking pilot” functionality through its VM system, Master-Worker framework for low latency job dispatch Extending/generalizing Panda into a condor-enabled generic WM system, deployed to OSG? A SciDAC-II proposal has been submitted for this, together with Condor team and USCMS group Xin Zhao, BNL November 18, 2018

Panda in production Steady utilization of ~500 USATLAS CPUs in months (as long as jobs are available) Reached >9000 jobs/day in a brief “scaling test”, ~ 4 times of the previous; no scaling limit found Lowest failure rate among all ATLAS executors (<10%) Half the shift manpower compared to previous Xin Zhao, BNL November 18, 2018

Conclusion Newly designed and implemented distributed production/analysis system for US ATLAS now in operation Designed for ‘one stop shopping’ distributed processing for US ATLAS, also interfaced to ATLAS production Based on internally managed queue/brokerage, pilot jobs and ‘just-in-time’ workload delivery Closely aligned with ATLAS DDM Shows dramatic increase in throughput/scalability and decrease in operations workload Analysis systems in place but latencies need to be improved May spawn a more generic effort in collaboration with Condor, CMS and other OSG groups Xin Zhao, BNL November 18, 2018

More information Panda Panda monitor/browser ATLAS DDM (DQ2) Xin Zhao, BNL November 18, 2018

Thanks to Condor team for constant, prompt response and assistance on our system tuning, troubleshooting and new feature implementation discussions! Xin Zhao, BNL November 18, 2018

Xin Zhao, BNL

Similar presentations

Presentation on theme: "Xin Zhao, BNL"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Xin Zhao, BNL

Similar presentations

Presentation on theme: "Xin Zhao, BNL"— Presentation transcript:

Similar presentations

About project

Feedback