CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Future Needs of User Support (in ATLAS) Dan van der Ster, CERN IT-GS & ATLAS WLCG Workshop.

Slides:



Advertisements
Similar presentations
ITIL: Service Transition
Advertisements

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Project Proposal: Academic Job Market and Application Tracker Website Project designed by: Cengiz Gunay Client: Cengiz Gunay Audience: PhD candidates and.
EGI-Engage Recent Experiences in Operational Security: Incident prevention and incident handling in the EGI and WLCG infrastructure.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG operations: communication channels Andrea Sciabà WLCG operations.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
What if you suspect a security incident or software vulnerability? What if you suspect a security incident at your site? DON’T PANIC Immediately inform:
CERN IT Department CH-1211 Genève 23 Switzerland t Service Management GLM 15 November 2010 Mats Moller IT-DI-SM.
Integration and Sites Rob Gardner Area Coordinators Meeting 12/4/08.
What if you suspect a security incident or software vulnerability? What if you suspect a security incident at your site? DON’T PANIC Immediately inform:
EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
Continuous Integration and Code Review: how IT can help Alex Lossent – IT/PES – Version Control Systems 29-Sep st Forum1.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
OFFLINE TRIGGER MONITORING TDAQ Training 5 th November 2010 Ricardo Gonçalo On behalf of the Trigger Offline Monitoring Experts team.
Overall Goal of the Project  Develop full functionality of CMS Tier-2 centers  Embed the Tier-2 centers in the LHC-GRID  Provide well documented and.
Experiment Support ANALYSIS FUNCTIONAL AND STRESS TESTING Dan van der Ster, CERN IT-ES-DAS for the HC team: Johannes Elmsheuser, Federica Legger, Mario.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Overview ROC_LA CERN
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)
…building the next IT revolution From Web to Grid…
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
Karsten Köneke October 22 nd 2007 Ganga User Experience 1/9 Outline: Introduction What are we trying to do? Problems What are the problems? Conclusions.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Priorities update Andrea Sciabà IT/GS Ulrich Schwickerath IT/FIO.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
CERN IT Department CH-1211 Genève 23 Switzerland t Towards agile software development Marwan Khelif IT-CS-CT IT Technical Forum – 31th May.
Online (GNAM) and offline (Express Stream and Tier0) monitoring produced results during cosmic/collision runs (Oct-Dec 2009) Shifter and expert level monitoring.
HammerCloud Functional tests Valentina Mancinelli IT/SDC 28/2/2014.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Drupal at CERN Juraj Sucik Jarosław Polok.
CERN - IT Department CH-1211 Genève 23 Switzerland t A Quick Overview of ITIL John Shade CERN WLCG Collaboration Workshop April 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
CERN - IT Department CH-1211 Genève 23 Switzerland t Operating systems and Information Services OIS Proposed Drupal Service Definition IT-OIS.
Kati Lassila-Perini EGEE User Support Workshop Outline: – CMS collaboration – User Support clients – User Support task definition – passive support:
INFSO-RI SA2 ETICS2 first Review Valerio Venturi INFN Bruxelles, 3 April 2009 Infrastructure Support.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
CERN IT Department CH-1211 Geneva 23 Switzerland t James Casey CCRC’08 April F2F 1 April 2008 Communication with Network Teams/ providers.
CERN IT Department CH-1211 Genève 23 Switzerland t Distributed Analysis User Support in ATLAS (and LHCb) Dan van der Ster, CERN IT-GS & ATLAS.
A GANGA tutorial Professor Roger W.L. Jones Lancaster University.
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 1 Managing changes Olof Bärring WLCG 2009, 14 th November 2008.
Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.
Comments on SPI. General remarks Essentially all goals set out in the RTAG report have been achieved. However, the roles defined (Section 9) have not.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
1 DIRAC Project Status A.Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille 10 March, DIRAC Developer meeting.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Towards an Information System Product Team.
ADC Operations Shifts J. Yu Guido Negri, Alexey Sedov, Armen Vartapetian and Alden Stradling coordination, ADCoS coordination and DAST coordination.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS Section input to GLM For GLM attended by Director for Computing.
ITIL: Service Transition
Monitoring of the infrastructure from the VO perspective
Leigh Grundhoefer Indiana University
EUDAT Site and Service Registry
Presentation transcript:

CERN IT Department CH-1211 Genève 23 Switzerland t Future Needs of User Support (in ATLAS) Dan van der Ster, CERN IT-GS & ATLAS WLCG Workshop – Prague, Czech Republic Sunday March 22, 2009

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 2 User support model In ATLAS, the model is tutorials + help forum (+ validation) –Physics Analysis Workbook “Running on Large Samples” –Offline Software Tutorials every ~6 weeks – catch all for both distributed analysis tools (Ganga + –(Validation is behind-the-scenes automated Functional and Stress Testing)

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 3 ATLAS Support Infrastructure DA Support Team (DAST) formed: –To relieve developers of the support burden –To support Pathena & Ganga through a single forum (the tools are working toward common source code) –To maintain documentation, enable users to help themselves – DAST was modeled after the ATLAS production shifts: –Reused their infrastructure (scheduling + calendar, some procedures) We asked the user community for volunteers to become expert shifters –Started Oct 2008 with 4 NA + 4 EU shifters Each week, we have 1 NA + 1 EU on shift: –Third time zone has no coverage  –Shifters are responsible for (a) directly helping users, (b) monitoring the analysis services, and (c) helping with user data management issues

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 4 Issue Tracking DAST is not a help desk: –Support is via an eGroups forum to enable user2user support Shifters need a shared interface to label, flag, and privately discuss the various threads/issues. –RT, Remedy, Savannah are not appropriate –We use a shared Gmail account

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 5 Common Issues 1.Usual DA issues: –Why did my job fail? My job ran yesterday but not today? 2.User support is not just DA support –The user workflow is (a) look for input data, (b) run the jobs, (c) retrieve the output data Need to support more than Ganga/pathena; (especially data management tools). 3.Users aren’t aware of the very nice monitoring: –Many users find it more convenient to ask why their job failed, rather than check what the monitoring is showing 4.Users don’t (and might never) know the policies: –i.e. where they can run, what inputs they can read, where they can store outputs, which storage locations are temporary/permanent, … –Policies are dynamic and inconsistently implemented 3 & 4 above imply that the end-user tools need to –fully enforce the policies, and –be fully integrated with the monitoring, especially by being aware of site downtimes

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 6 Part II: The Future Needs of User Support

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 7 Ideal and Non-ideal Users How does the ideal user get help with a problem? 1.Read documentation (--help, tutorials, FAQs, mailing list archives) 2.Ask office mates or other colleagues 3.Post to mailing lists or discussion forums –(the problem with #2 is that the wide community doesn’t always benefit) Today, we get many “stupid” questions on the mailing lists: –“My job/data-transfer failed!” at site in posted downtime –Question that is already answered in docs, FAQ, or yesterday. But are they really stupid? Two opinions: 1.As users get experience, these questions will disappear. 2.There are no stupid questions: they indicate weaknesses in the documentation or end-user tools, and we need to improve: User-oriented monitoring Organization and relevance of the documentation User discussion forums I think that we are not enabling users to act ideally today.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 8 Monitoring for Users User-oriented monitoring: –AFAICT, users are not aware of and don’t look at the monitoring –Users look at (a) the job submission tool and (b) the job monitoring tool. Sometimes these 2 are the same. –The relevant monitoring info must be obvious from the end-user tools: E.g. A job listed as “activated” should say how long until it will run. If the selected site is now offline, this should be stated. –Current monitoring is too spread around; users need a single entry point. “Status Awareness” in the end-user tools: –Some tools already look at the service statuses: Ex: GangaRobot blacklisting in ATLAS. –Needs to be implemented in all end-user tools: Ex: data retrieve tools should not try a transfer from a down site –The information must be updated frequently; sites need to be de- blacklisted ASAP.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 9 The Discussion Forum Users are asking questions that are already covered in documentation or other threads –Like the site/service monitoring, the documentation is spread around. Tutorials, Twiki docs, FAQs, --help –And some documentation does not exist: HOWTO knowledgebase of working job configurations We also need to enable and encourage user2user support. What would Google do? –Google Help is a combination of Discussion Forums + Question & Answer –Forum: Same as eGroups, but we need to be able to “stick” (highlight) some important threads –Q&A: Users ask questions, and others post answers. Answers are voted on and the marked read only when the real answer is given.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DA User Support - 10 The Need for Experts Will we always need expert user shifters? –Most (physicist) users will never become grid experts –There will always be only a few who understand the details which result in some job failures Ex: we can never expect a user to recognize that an athena crash was due to an overloaded storage pool. –I think that expert user shifters will be needed for the visible future. In ATLAS, we estimate that we will need to at least double the manpower on DAST shifters –From 1+1  2 US + 2 EU (+2 ASIA) With more than 2 shifters on at once, communication between shifters (including production shifters) becomes critical: –Issue tracking: Gmail probably won’t scale with >1 shifter on at once. Need another collaborative support tool. –Communication with other shifters: Experimenting with Virtual Control Room