Lessons Learned: The Organizers

Slides:

Advertisements

Similar presentations

Operations Testing in the ISOC Service Challenges are a successor and extension to the successful Data Challenge model Broader set of objectives Continue.

Advertisements

1 Automated Testing & Test Tools Apirada Thadadech.

GLAST LAT SLAC DoE Review June 13, 2007 R.Dubois1 GLAST Large Area Telescope: Science Analysis Systems and Collaboration Computing Needs Robert Cameron,

Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.

GLAST LAT Project ISOC Peer Review - March 2, 2004 Document: LAT-PR Section 4.3 Pipeline, Data Storage and Networking Issues1 Gamma-ray Large.

R.Dubois 12 Jan 2005 Generating MC – User Experience 1/6 GLAST SAS Data Handling Workshop – Pipeline Session Running MC & User Experience Template for.

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

GLAST LAT Project LAT Monthly Apr SAS 1 Gamma-ray Large Area Space Telescope GLAST Large Area Telescope WBS 4.1.D SAS LAT Monthly Managerfest.

Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.

Windows Server 2008 Chapter 11 Last Update

Online Data Challenges David Lawrence, JLab Feb. 20, /20/14Online Data Challenges.

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

GLAST LAT ProjectDOE/NASA Baseline-Preliminary Design Review, January 8, 2002 K.Young 1 LAT Data Processing Facility Automatically process Level 0 data.

Chapter 16 Designing Effective Output. E – 2 Before H000 Produce Hardware Investment Report HI000 Produce Hardware Investment Lines H100 Read Hardware.

1 Data Management D0 Monte Carlo needs The NIKHEF D0 farm The data we produce The SAM data base The network Conclusions Kors Bos, NIKHEF, Amsterdam Fermilab,

D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,

Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.

GLAST Science Support CenterJuly, 2003 LAT Ground Software Workshop Status of the D1 (Event) and D2 (Spacecraft Data) Database Prototypes for DC1 Robert.

Karsten Köneke October 22 nd 2007 Ganga User Experience 1/9 Outline: Introduction What are we trying to do? Problems What are the problems? Conclusions.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

MultiJob pilot on Titan. ATLAS workloads on Titan Danila Oleynik (UTA), Sergey Panitkin (BNL) US ATLAS HPC. Technical meeting 18 September 2015.

SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

GLAST Science Support Center June 29, 2005Data Challenge II Software Workshop User Support Goals For DC 2 James Peachey GSFC/L3.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

GLAST Beamtest 2006 Pisa R.Dubois1/3 Offline Possibilities for Ancillary Data Handling.

11th September 2002Tim Adye1 BaBar Experience Tim Adye Rutherford Appleton Laboratory PPNCG Meeting Brighton 11 th September 2002.

July 7, System and Network Administration: Introduction Abdul Wahid.

Canadian Bioinformatics Workshops

Storage for Run 3 Rainer Schwemmer, LHCb Computing Workshop 2015.

GLAST LAT ProjectNovember 18, 2004 I&T Two Tower IRR 1 GLAST Large Area Telescope: Integration and Test Two Tower Integration Readiness Review SVAC Elliott.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

Mohammed I DAABO COURSE CODE: CSC 355 COURSE TITLE: Data Structures.

BaBar Transition: Computing/Monitoring

Shared Services with Spotfire

U.S. ATLAS Grid Production Experience

Glast Collaboration Data Server and Data Catalog

Measurement-based Design

Hands-On Microsoft Windows Server 2008

Applying Control Theory to Stream Processing Systems

Existing Perl/Oracle Pipeline

Readiness of ATLAS Computing - A personal view

LCGAA nightlies infrastructure

CC IN2P3 - T1 for CMS: CSA07: production and transfer

Survey on User’s Computing Experience

US CMS Testbed.

Linternals SysInternals for Linux

Main Memory Management

5 SYSTEM SOFTWARE CHAPTER

GLAST Large Area Telescope

How to Fix Secure Connection Error in WordPress?.

Grid Canada Testbed using HEP applications

湖南大学-信息科学与工程学院-计算机与科学系

11/17/ :39 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.

GLAST Large Area Telescope

LAT Data Server Serve what?

5 SYSTEM SOFTWARE CHAPTER

GLAST Large Area Telescope

Experience with the process automation at SORS

Data Challenge 2: Aims and Status

Kanga Tim Adye Rutherford Appleton Laboratory Computing Plenary

Data Challenge 1 Closeout Lessons Learned Already

GLAST Large Area Telescope

Level 1 Processing Pipeline

GLAST Large Area Telescope Instrument Science Operations Center

The LHCb Computing Data Challenge DC06

Presentation transcript:

Lessons Learned: The Organizers

Kinds of Lessons* Operational Distributing the Code Making the sky data Required compute resources Required people resources Remaking the sky data Distributing the data - DataServers Functional Problems extracting livetime history Problems extracting pointing history – SAA entry/exit Organizational How/when to draw on expert help for problem solving Sky model Confluence/Workbook Analysis Access to standard cuts GTIs Livetime cubes, diffuse response * Or things to fix for next time

Making the Sky 1 Code Distribution Navid made nice self-installers with wrappers that took care of env vars etc Creation of distributions is semi-manual. Should find out how to automate – rules based We needed a lot more compute resources than we anticipated 200k CPU hrs for background and sky generation Did sky gen (30k CPU-hrs) twice  need more compute resources under our control then planned – maxed out at SLAC with svac, DC2, BT, Handoff Aiming for 350-400 “GLAST” boxes + call on SLAC general queues for noticeable periods Berrie ran 10,000 jobs at Lyon for the original background CT runs – a horrible thing to have to do Manualy transferred merit files back to SLAC Extend LAT automated pipeline infrastructure to make use of not-SLAC compute farms (may or may not have to transfer files back to SLAC) – Lyon; UW; Padova?; GSFC-LHEA? Speaks to maximizing sims capabilty We juggled priority with SVAC commissioning Pipeline 1 handles 2 “streams” well enough More would have been tricky Ate up about 3-4 TB of disk to keep all MC, Digi, Recon etc files  Pipeline 2

Making the Sky 2 People resources Tom Glanzman put his BABAR expertise to minimize exposure to SLAC resource bottlenecks Accessing nfs from upwards of 400 CPUs was the biggest problem Use afs and batch node local disk as much as possible Made good use of SCS’ Ganglia server/disk monitoring tools Developed pipeline performance plots (as shown at the Kickoff meeting) Tom and I (mostly Tom) ran off the DC2 datasets Some complexity due to secret sky code and configs Some complexity due to last minute additions of variables calculated outside Gleam Effort front loaded – setting up tasks Now a fairly small load to monitor/repair during routine running Some cleanup at the end Root4Root5 transition disrupted the DataServer Will likely need a “volunteer” for future big LAT simulations

Grab Bag Great to have GBM involved! Should at least have archival copy of GBM simulation code used DC2 Confluence worked Nice organization by Seth on Forum and Analysis pages Easy to use and peruse Will clone for Beamtest Great teamwork It was really fun to work with this group The secret sky made it hard to ask many people to help with problems – but that is behind us now Histories Pointing and livetime needed manual intervention to fix SAA passages etc. Should track that down. Analysis details Might have been nice to have Class A/B in merit (IMHO) GTIs were a pain if you got them wrong. Tools now more tolerant. Livetime cubes were made by hand Diffuse Response in FT1 was somewhat cobbled together

GSSC Data Server 890 hits total during DC2 repopulating the server is manual; 2 months takes about 5 hrs brings up questions: what chunks of data will be retransmitted to GSSC? what are “failure modes” for data delivery what will “Event” data look like? how many versions of data to be kept online in servers?

LAT DataServer Usage ½ of usage from Julie! similar questions posed as from GSSC server

Lessons Statistics don’t include “astro” data server or WIRED event display use. Lessons Learned Problem: Jobs running out of time Need more accurate way to predict time, or run jobs with no time limit Problem: Need clearer notification to user if job fails LAT Astro server never got the GTIs right Hence little used, even as west coast US mirror Were not able to implement efficient connection to Root files (main reason for its existence). Still needs work. Unknown if limited use of Event Display is significant.