PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

Slides:



Advertisements
Similar presentations
Integrating Network Awareness in ATLAS Distributed Computing Using the ANSE Project J.Batista, K.De, A.Klimentov, S.McKee, A.Petroysan for the ATLAS Collaboration.
Advertisements

Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
LHCONE Point-to-Point Service Workshop - CERN Geneva Eric Boyd, Internet2 Slides borrowed liberally from Artur, Inder, Richard, and other workshop presenters.
Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Integration Program Update Rob Gardner US ATLAS Tier 3 Workshop OSG All LIGO.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.
FAX UPDATE 26 TH AUGUST Running issues FAX failover Moving to new AMQ server Informing on endpoint status Monitoring developments Monitoring validation.
DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.
PanDA Summary Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.
PanDA: Exascale Federation of Resources for the ATLAS Experiment
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.
Efi.uchicago.edu ci.uchicago.edu FAX status developments performance future Rob Gardner Yang Wei Andrew Hanushevsky Ilija Vukotic.
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
PD2P The DA Perspective Kaushik De Univ. of Texas at Arlington S&C Week, CERN Nov 30, 2010.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
PANDA: Networking Update Kaushik De Univ. of Texas at Arlington SC15 Demo November 18, 2015.
PanDA & BigPanDA Kaushik De Univ. of Texas at Arlington BigPanDA Workshop, CERN October 21, 2013.
Julia Andreeva on behalf of the MND section MND review.
SDN Provisioning, next steps after ANSE Kaushik De Univ. of Texas at Arlington US ATLAS Planning, CERN June 29, 2015.
Conclusions on Monitoring CERN A. Read ADC Monitoring1.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
Efi.uchicago.edu ci.uchicago.edu Storage federations, caches & WMS Rob Gardner Computation and Enrico Fermi Institutes University of Chicago BigPanDA Workshop.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
Network integration with PanDA Artem Petrosyan PanDA UTA,
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Dynamic Data Placement: the ATLAS model Simone Campana (IT-SDC)
Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
PanDA & Networking Kaushik De Univ. of Texas at Arlington ANSE Workshop, CalTech May 6, 2013.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
Joint Institute for Nuclear Research Synthesis of the simulation and monitoring processes for the data storage and big data processing development in physical.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Accounting John Gordon WLC Workshop 2016, Lisbon.
Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.
PanDA Configurator and Network Aware Brokerage Fernando Barreiro Megino, Kaushik De, Tadashi Maeno 14 March 2015, US ATLAS Distributed Facilities Meeting,
PD2P Planning Kaushik De Univ. of Texas at Arlington S&C Week, CERN Dec 2, 2010.
PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.
Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
Grid Computing 4 th FCPPL Workshop Gang Chen & Eric Lançon.
Efi.uchicago.edu ci.uchicago.edu Sharing Network Resources Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago Federated Storage.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
PanDA & Networking Kaushik De Univ. of Texas at Arlington UM July 31, 2013.
Daniele Bonacorsi Andrea Sciabà
Computing Operations Roadmap
ATLAS Grid Information System
U.S. ATLAS Grid Production Experience
Outline Benchmarking in ATLAS Performance scaling
POW MND section.
PanDA in a Federated Environment
Readiness of ATLAS Computing - A personal view
The ADC Operations Story
Univ. of Texas at Arlington BigPanDA Workshop, ORNL
Cloud Computing R&D Proposal
Roadmap for Data Management and Caching
Presentation transcript:

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014

Overview  We are nearing end of ANSE project ~6 months  Review goals/scope of PanDA work in ANSE  Assess progress so far  PanDA work started ~1 year ago  Plans for completion of current work  Plans for new work  Discuss tomorrow  Synergy with other projects  Artem is co-funded by DOE-ASCR BigPanDA project  BigPanDA continues for ~9 months after ANSE ends  What happens after 2015? May 13, 2014Kaushik De2

PanDA Goals  Explicit integration of Networking with PanDA  Never before attempted for any WMS  PanDA has many implicit assumptions about networking  Goal 1: Use network information directly in PanDA workflow  Goal 2: Attempt direct control (provisioning) through PanDA  ANSE + DOE-ASCR  Picked few well defined topics  Set up infrastructure and interactions with other projects  Develop and deploy software  Evaluation metrics  Deliver new capabilities for LHC experiments  This is not only R&D – use in production environment May 13, 2014Kaushik De3

PanDA Steps  Collect network information  Storage and access  Using network information  Using dynamic circuits May 13, 2014Kaushik De4

Sources of Network Information  DDM Sonar measurements  Actual transfer rates for files between all sites (Tier 1 and Tier 2)  This information is normally used for site white/blacklisting  Measurements available for small, medium, and large files  perfSonar (PS) measurements  perfSonar provides dedicated network monitoring data  All WLCG sites are being instrumented with PS boxes  US sites are already instrumented and monitored  Federated XRootD (FAX) measurements  Read-time of remote files are measured for pairs of sites  This is not an exclusive list – just a starting point May 13, 2014Kaushik De5

DDM Sonar May 13, 2014Kaushik De6

perfSonar May 13, 2014Kaushik De7

FAX May 13, 2014Kaushik De8

May 13, 2014Kaushik De9

Data Repositories  Three levels of data storage and access  Native data repositories  Historical data stored from collectors  SSB – site status board for sonar and perfSonar data  FAX data is kept independently and uploaded  AGIS (ATLAS Grid Information System)  Most recent / processed data only – updated periodically  Mixture of push/pull – moving to JSON API (pushed only)  schedConfigDB  Internal Oracle DB used by PanDA for fast access  Uses standard ATLAS collector May 13, 2014Kaushik De10

May 13, 2014Kaushik De11

Using Network Information  Pick a few use cases  Important to PanDA users  Enhance workload management through use of network  Should provide clear metrics for success/failure  Case 1: Improve User Analysis workflow  Case 2: Improve Tier 1 to Tier 2 workflow May 13, 2014Kaushik De12

Improving User Analysis  In PanDA, user jobs go to data  Typically, user jobs are IO intensive – hence constrain jobs to data  Note - almost any user payload is allowed by PanDA  User analysis jobs are routed automatically to T1/T2 sites  For popular data, bottlenecks develop  If data is only at a few sites, user jobs have long wait times  PD2P was implemented 3 years ago to solve this problem  Additional copies are made asynchronously by PanDA  Waiting jobs are automatically re-brokered to new sites  But bottlenecks still take time to clear up  Can we do something else using network information?  Why not use FAX?  First we need to develop network metrics for efficient use of FAX May 13, 2014Kaushik De13

Faster User analysis through FAX  First use case for network integration with PanDA  PanDA brokerage will use concept of ‘nearby’ sites  Calculate weight based on usual brokerage criteria (availability of CPU, release, pilot rate…)  Add network transfer cost to brokerage weight  Jobs will be sent to the site with best weight – not necessarily the site with local data  If nearby site has less wait time, access the data through FAX May 13, 2014Kaushik De14

First Tests  Tested in production for ~1 day in March, 2014  Useful for debugging and tuning direct access infrastructure  We got first results on network aware brokerage  Job distribution  4748 jobs from 20 user tasks which required data from congested U.S. Tier 1 site were automatically brokered to U.S. Tier 1/2 sites May 13, 2014Kaushik De15

Brokerage Results May 13, 2014Kaushik De16

Conclusions for Case 1  Network data collection working well  Additional algorithms to combine network data will be tried  HC tests working well – but PS data not robust yet  PanDA brokerage worked well  Achieved goal of reducing wait time  Well balanced local vs remote access  Will fine tune after more data on performance  Waiting for final implementation  But we have no data on actual performance of successful jobs  Need to test and validate sites for this mode of data access  First tests in March had 100% failure rate (FAX deployment related)  Second test 1 week ago also did not go well  Expect third test soon May 13, 2014Kaushik De17

Managing Data Rates  Tests have shown direct access rates need to be managed  Parameters for WAN throttling implemented in PanDA  Throttling at brokerage level is easy (eg. ratio FAX jobs/non FAX jobs), but does not guarantee throttling during execution  Throttling during dispatch is not scalable when million jobs are dispatched daily (scale may be higher in the future)  Throttling may also be done at pilot level  PanDA has implemented a mixed approach to throttling, being tested now May 13, 2014Kaushik De18

Cloud Selection  Second use case for network integration with PanDA  Optimize choice of T1-T2 pairings (cloud selection)  In ATLAS, production tasks are assigned to Tier 1’s  Tier 2’s are attached to a Tier 1 cloud for data processing  Any T2 may be attached to multiple T1’s  Currently, operations team makes this assignment manually  This could/should be automated using network information  For example, each T2 could be assigned to a native cloud by operations team, and PanDA will assign to other clouds based on network performance metrics May 13, 2014Kaushik De19

DDM Sonar Data May 13, 2014Kaushik De20

Tier 1 View May 13, 2014Kaushik De21

More T1 Information May 13, 2014Kaushik De22

Tier 2 View May 13, 2014Kaushik De23

Improving Site Association May 13, 2014Kaushik De24

More T2 Information May 13, 2014Kaushik De25

Conclusion for Case 2  Working well in real time  Currently implementing archival information  Keep data for last ‘n’ Tier 1 – Tier 2 associations  Necessary to check robustness of approach  Algorithm may use the historical information in the future  Expect to deploy this summer  Hopefully ~1 month May 13, 2014Kaushik De26

Summary  First 2 use cases for network integration with PanDA working well  Work will be completed this summer  Metrics showing usefulness of approach will be available in Fall  On track for timely final report to ANSE May 13, 2014Kaushik De27