PanDA in a Federated Environment

Slides:

Advertisements

Similar presentations

B A B AR and the GRID Roger Barlow for Fergus Wilson GridPP 13 5 th July 2005, Durham.

Advertisements

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

BaBar Grid Computing Eleonora Luppi INFN and University of Ferrara - Italy.

D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.

Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.

Grid Applications for High Energy Physics and Interoperability Dominique Boutigny CC-IN2P3 June 24, 2006 Centre de Calcul de l’IN2P3 et du DAPNIA.

What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.

Tier 3 Data Management, Tier 3 Rucio Caches Doug Benjamin Duke University.

1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.

PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.

SAM and D0 Grid Computing Igor Terekhov, FNAL/CD.

DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

PanDA Summary Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.

PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

LCG CCRC’08 Status WLCG Management Board November 27 th 2007

Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,

PD2P The DA Perspective Kaushik De Univ. of Texas at Arlington S&C Week, CERN Nov 30, 2010.

Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

Distributed Data for Science Workflows Data Architecture Progress Report December 2008.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Efi.uchicago.edu ci.uchicago.edu Storage federations, caches & WMS Rob Gardner Computation and Enrico Fermi Institutes University of Chicago BigPanDA Workshop.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.

Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,

PanDA & Networking Kaushik De Univ. of Texas at Arlington ANSE Workshop, CalTech May 6, 2013.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)

Computing Fabrics & Networking Technologies Summary Talk Tony Cass usual disclaimers apply! October 2 nd 2010.

PD2P Planning Kaushik De Univ. of Texas at Arlington S&C Week, CERN Dec 2, 2010.

PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

PanDA & Networking Kaushik De Univ. of Texas at Arlington UM July 31, 2013.

1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.

THE ATLAS COMPUTING MODEL Sahal Yacoob UKZN On behalf of the ATLAS collaboration.

a brief summary for users

Daniele Bonacorsi Andrea Sciabà

Status: ATLAS Grid Computing

Computing Operations Roadmap

Eleonora Luppi INFN and University of Ferrara - Italy

Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017

Global Data Access – View from the Tier 2

ATLAS Use and Experience of FTS

Computing models, facilities, distributed computing

U.S. ATLAS Grid Production Experience

U.S. ATLAS Tier 2 Computing Center

Outline Benchmarking in ATLAS Performance scaling

PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL

Data Challenge with the Grid in ATLAS

Fine grained processing with an Event Service

WLCG: TDR for HL-LHC Ian Bird LHCC Referees’ meting CERN, 9th May 2017.

AMI – Status November Solveig Albrand Jerome Fulachier

Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group

Readiness of ATLAS Computing - A personal view

Data Federation with Xrootd Wei Yang US ATLAS Computing Facility meeting Southern Methodist University, Oct 11-12, 2011.

FDR readiness & testing plan

Univ. of Texas at Arlington BigPanDA Workshop, ORNL

Cloud Computing R&D Proposal

Monitoring of the infrastructure from the VO perspective

Grid Canada Testbed using HEP applications

湖南大学-信息科学与工程学院-计算机与科学系

Using an Object Oriented Database to Store BaBar's Terabytes

Roadmap for Data Management and Caching

ATLAS DC2 & Continuous production

Presentation transcript:

PanDA in a Federated Environment Kaushik De Univ. of Texas at Arlington CC-IN2P3, Lyon September 13, 2012

Outline Overview Concrete plans Speculative ideas Federated data access/stageout for fault tolerance Federated data transfer for managed production Federated data access for distributed analysis Speculative ideas Data caching Event caching Cache aware brokerage Kaushik De

PanDA FAX Status Last year, I talked about local federations Direct access through local redirectors are in use by PanDA at SLAC and SouthWest Tier 2 – working well for many years This year, the emphasis has been on global federations Global redirectors have been set up and tested in ATLAS Changes were implemented in the PanDA pilot to enable these global redirectors in the default workflow But progress has been somewhat slow PanDA under continuous use in ATLAS Development activities not related to LHC data have been minimal Kaushik De

FAX for Fault Tolerance Phase I goal If input file cannot be transferred/accessed from local SE, PanDA pilot currently fails the job after a few retries We plan to use Federated storage for these (rare) cases Start with file staging/transfers using FAX Implemented in recent release of pilot, works fine at two test sites Next step – wider scale testing at production/DA sites Phase 2 Once file transfers work well, try FAX Direct Access Phase 3 Try FAX for transfer of output files, if default destination fails Next few slides from Tadashi/Paul Kaushik De

Kaushik De

FAX for Managed Production Managed production has well defined workflow PanDA schedules all input/output file transfers through DQ2 DQ2 provides dataset level callback when transfers are completed FAX can provide alternate transport mechanism Transfers handled by FAX Dataset level callback provided by FAX Dataset discovery/registration handled by DQ2 File level callback Recent development – use activeMQ for file level callbacks On best effort basis for scalability – dataset callbacks still used FAX can use same mechanism Work in progress Kaushik De

FAX for Distributed Analysis Most challenging and most rewarding Currently, DA jobs are brokered to sites which have input datasets This may limit and slow the execution of DA jobs Use FAX to relax constraint on locality of data Use cost metric generated with Hammercloud tests Provides ‘typical cost’ of data transfer between two sites Brokerage will use ‘nearby’ sites Calculate weight based on usual brokerage criteria (availability of CPU…) plus transfer cost Jobs will be dispatched to site with best weight – not necessarily the site with local data or available CPU’s Cost metric already available (see Ilija/Rob talks) Kaushik De

Implementation Schedule FAX for fault tolerance Phase 1 (FAX transfers) – done, test for few months Phase 2 (FAX Direct Access) – before year end Phase 3 (FAX output) – before year end FAX for central production Within 6 months Maybe sooner – activeMQ is already under testing FAX in brokerage Cost metric already available Few months to setup and test in PanDA database Next year – enable a few sites for high throughput tests Kaushik De

Data Caching Local data caching for WAN access Maybe not for PanDA – can federation do it transparently? Various alternatives were discussed in WAN meeting at CERN PanDA could keep site level cache Not guaranteed file catalog – best effort list Use FAX to fetch again if file if no longer available Kaushik De

Event Cache Long term PanDA goal – event service Granularity of data processing in PanDA – datasets and files But events are really the atomic unit for HEP PanDA event service will change current processing model Challenges of event service Scalability – keeping track of 100’s of billions of events Fault tolerance – processing all events without data loss Chaining of data processing Efficient use of WAN vs storage Kaushik De

Kaushik De

Kaushik De

Conclusion Wide array of FAX plans for PanDA Schedule depends on availability of effort during LHC run Do not foresee technical challenges for short/medium term Long term – many open ideas, some quite challenging Kaushik De