CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Distributed Analysis User Support in ATLAS (and LHCb) Dan van der Ster, CERN IT-GS & ATLAS.

Slides:

Advertisements

Similar presentations

LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.

Advertisements

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG operations: communication channels Andrea Sciabà WLCG operations.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

What if you suspect a security incident or software vulnerability? What if you suspect a security incident at your site? DON’T PANIC Immediately inform:

CERN IT Department CH-1211 Genève 23 Switzerland t Service Management GLM 15 November 2010 Mats Moller IT-DI-SM.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What GGUS can do for you JRA1 All hands.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.

Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli.

DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager

CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

Experiment Support ANALYSIS FUNCTIONAL AND STRESS TESTING Dan van der Ster, CERN IT-ES-DAS for the HC team: Johannes Elmsheuser, Federica Legger, Mario.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Overview ROC_LA CERN

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

WLCG Service Report ~~~ WLCG Management Board, 9 th August

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Report from GGUS BoF Session at the WLCG.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

Grid Security Vulnerability Group Linda Cornwall, GDB, CERN 7 th September 2005

Karsten Köneke October 22 nd 2007 Ganga User Experience 1/9 Outline: Introduction What are we trying to do? Problems What are the problems? Conclusions.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.

Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.

CERN IT Department t LHCb Software Distribution Roberto Santinelli CERN IT/GS.

Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.

CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.

WLCG Laura Perini1 EGI Operation Scenarios Introduction to panel discussion.

CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

CERN - IT Department CH-1211 Genève 23 Switzerland t A Quick Overview of ITIL John Shade CERN WLCG Collaboration Workshop April 2008.

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

Kati Lassila-Perini EGEE User Support Workshop Outline: – CMS collaboration – User Support clients – User Support task definition – passive support:

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.

INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

Testing Infrastructure Wahid Bhimji Sam Skipsey Intro: what to test Existing testing frameworks A proposal.

CERN IT Department CH-1211 Genève 23 Switzerland t Future Needs of User Support (in ATLAS) Dan van der Ster, CERN IT-GS & ATLAS WLCG Workshop.

Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.

CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,

ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.

WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.

JRA1 Meeting – 09/02/ Software Configuration Management and Integration EGEE is proposed as a project funded by the European Union under contract.

Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Best Practices and Use cases David Bouvet,

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

ADC Operations Shifts J. Yu Guido Negri, Alexey Sedov, Armen Vartapetian and Alden Stradling coordination, ADCoS coordination and DAST coordination.

CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t DPM status and plans David Smith CERN, IT-DM-SGT Pre-GDB, Grid Storage Services 11 November.

and Network Management Presented by: Information Technology.

Cross-site problem resolution Focus on reliable file transfer service

Readiness of ATLAS Computing - A personal view

Presentation transcript:

CERN IT Department CH-1211 Genève 23 Switzerland t Distributed Analysis User Support in ATLAS (and LHCb) Dan van der Ster, CERN IT-GS & ATLAS Contributions from Andrew Maier, CERN IT-GS & LHCb WLCG Workshop – Prague, Czech Republic Sunday March 22, 2009

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 2 First, some numbers… From A. Maier (LHCb): –In the last 3 months close to 200 unique users in LHCb Stably increasing trend ⅓ of the entire collaboration Probably close to the total number expected physicists involved in analysis –More than 60 physicists using Ganga on average per week For ATLAS, Ganga & Pathena has seen ~500 + ~500 unique users in the past six months: –~125 + ~125 total unique per month –The number of users is still expected to increase For both experiments, the number of jobs will of course increase, so we are not yet at the peak user support load.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 3 User support models In LHCb, the model is tutorials + help forum (+ validation) –Ganga introduction and hands-on tutorial is a part of the LHCb core software training (repeated every 2 months) Additional external software training organised 2-3 per year –User support through mailing lists: – covers all aspects of distributed analysis questions including Ganga – covers specialised ganga support questions. –Validation with SAM tests In ATLAS, the model is tutorials + help forum (+ validation) –Physics Analysis Workbook “Running on Large Samples” –Offline Software Tutorials every ~6 weeks – catch all for both distributed analysis tools (Ganga + –Validation is behind-the-scenes automated Functional and Stress Testing

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 4 ATLAS Support Infrastructure DA Support Team (DAST) formed: –To relieve developers of the support burden –To support Pathena & Ganga through a single forum (the tools are working toward common source code) –To maintain documentation, enable users to help themselves – DAST was modeled after the ATLAS production shifts: –Reused their infrastructure (scheduling + calendar, some procedures) We asked the user community for volunteers to become expert shifters –Started Oct 2008 with 4 NA + 4 EU shifters Each week, we have 1 NA + 1 EU on shift: –Third time zone has no coverage  –Shifters are responsible for (a) directly helping users, (b) monitoring the analysis services, and (c) helping with user data management issues

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 5 AtlasDAST Responsibilities (1) 1.Provide help via the DA Help Forum: –We see two basic problems: –How do I do x? We (create and) forward the user to documentation –My analysis doesn't submit / run / complete / output! View logfiles, check sites, try to reproduce problem; these take time to solve. Escalate to another expert: Oct 1 -> March 4: 621 “conversations” ~125 per month ~4 per day February 2009: 155 conversations ~5.5 per day

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 6 AtlasDAST Responsibilities (2) 2.Help with Dataset Replication Requests Users can request a dataset to be replicated to a new site This can’t be allowed freely (model is jobs to data, not data to jobs) If the request is for >10GB transfer and the data is already available within the destination cloud: DAST intervenes and helps the user to process the data in the present location

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 7 AtlasDAST Responsibilities (3) 3.Monitor DA Functional Tests (GangaRobot in SAM): –Check the daily tests, file GGUS ticket when failures occur. GangaRobot runs short analysis jobs on all sites a few times per day: –Validates the full analysis workflow in one test: Ganga + Middleware + Athena + Data Management –Testing all ATLAS grids: EGEE, OSG, NorduGrid GangaRobot automatically disables EGEE sites if they fail a test: –Ganga avoids the sites “blacklisted” by GangaRobot –For other grids, sites are disabled manually

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 8 Issue Tracking DAST is not a help desk: –Support is via an eGroups forum to enable user2user support Shifters need a shared interface to label, flag, and privately discuss the various threads/issues. –RT, Remedy, Savannah are not appropriate –We use a shared Gmail account

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 9 Gmail Issue Tracking Gmail works for us, but it isn’t perfect Our procedures for using gmail consistently: Gmail Issue Tracking Procedure 1. Open (unresolved) issues are to remain in the Inbox. Closed (resolved) issues are to be archived. 2. Escalated issues are to be considered open until the problem is resolved. 3. Open threads can be in one of 5 states: 1. Requiring attention: these include any thread with no labels, or is starred, or that has an unread reply. 2. Waiting for user response: label these WAITING, then ignore, and finally remove the label when the user has replied. 3. Requiring urgent attention: label these URGENT, and act accordingly. 4. Escalated: label these ESCALATED, and add a "to where" label (see (4)(1)); shifters should inform the user then close the issue when it is resolved. If an issue is escalated but still unresolved after a reasonable amount of time, we should contact "to where" for an update. 5. Fixed in the next release: label these "FIXED in next release", and contact the user after the next release to remind/verify that the issue is fixed, and finally close. Use your judgment to decide if the thread can instead be immediately closed (i.e. if you feel that we don't need to follow up after the release). 4. Labels other than those mentioned in (3) can be used for information purposes, including: 1. to where an issue has been escalated, e.g. Ganga Expert, GGUS, DQ2 Savannah temporary labels used to track common issues, e.g. mc08 dataset problem. 3. other labels for arbitrary information/tracking purposes. 5. To close an issue, remove any labels mentioned in (3), (4)(1), or (4)(2) and then archive the thread. Labels from (4)(3) can optionally remain on closed issues for reference later. The above guidelines have the following implications: 1. To find issues needing attention, just browse to the Inbox and look for unlabeled, starred, and unread threads. 2. There will be many items in the Inbox that will not require attention. This is OK. 3. Issues will stay in WAITING until the user responds. I don't think we need to contact users if they are too uninterested to reply. I suggest we close inactive WAITING threads after 7 days. 4. Threads that we close will be automagically reopened if the user or anyone else replies to the thread. Thus, you can safely "close" a thread and it will reopen itself if the user doesn't agree with you. Perhaps this means that the WAITING state is redundant, but at least I find it useful to keep these obviously open threads in view, and thus in mind. 5. If another user happens to resolve an issue without DAST intervention, just archive the thread and move on. 6. If you find a thread in an inconsistent state, try to find out its real status and correct the labels. 7. Feel free to create new labels under (4)(2) or (4)(3); please communicate their meanings to the other shifters if they are to persist, or otherwise delete them at the end of your shift week.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 10 Common Issues 1.Usual DA issues: –Why did my job fail? My job ran yesterday but not today? 2.User support is not just DA support –The user workflow is (a) look for input data, (b) run the jobs, (c) retrieve the output data Need to support more than Ganga/pathena; (especially data management tools). 3.Users aren’t aware of the very nice monitoring: –Many users find it more convenient to ask why their job failed, rather than check what the monitoring is showing 4.Users don’t (and might never) know the policies: –i.e. where they can run, what inputs they can read, where they can store outputs, which storage locations are temporary/permanent, … –Policies are dynamic and inconsistently implemented 3 & 4 above imply that the end-user tools need to –fully enforce the policies, and –be fully integrated with the monitoring, especially by being aware of site downtimes

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 11 “Behind-the-scenes” Testing Already saw the automated functional testing: GangaRobot Also running large automated stress tests with HammerCloud: % CPU UsedEvents/second 74 sites tested; top sites tested >25 times >50000 jobs with average runtime of 2.2 hours. >10.5 million files (>3 billion events) Testing different data I/O configurations (e.g. posix I/O vs copy-and-process) Also used to evaluate new or changed sites.