Lessons Learned: The Organizers
Kinds of Lessons* Operational Distributing the Code Making the sky data Required compute resources Required people resources Remaking the sky data Distributing the data - DataServers Functional Problems extracting livetime history Problems extracting pointing history – SAA entry/exit Organizational How/when to draw on expert help for problem solving Sky model Confluence/Workbook Analysis Access to standard cuts GTIs Livetime cubes, diffuse response * Or things to fix for next time
Making the Sky 1 Code Distribution Navid made nice self-installers with wrappers that took care of env vars etc Creation of distributions is semi-manual. Should find out how to automate – rules based We needed a lot more compute resources than we anticipated 200k CPU hrs for background and sky generation Did sky gen (30k CPU-hrs) twice need more compute resources under our control then planned – maxed out at SLAC with svac, DC2, BT, Handoff Aiming for 350-400 “GLAST” boxes + call on SLAC general queues for noticeable periods Berrie ran 10,000 jobs at Lyon for the original background CT runs – a horrible thing to have to do Manualy transferred merit files back to SLAC Extend LAT automated pipeline infrastructure to make use of not-SLAC compute farms (may or may not have to transfer files back to SLAC) – Lyon; UW; Padova?; GSFC-LHEA? Speaks to maximizing sims capabilty We juggled priority with SVAC commissioning Pipeline 1 handles 2 “streams” well enough More would have been tricky Ate up about 3-4 TB of disk to keep all MC, Digi, Recon etc files Pipeline 2
Making the Sky 2 People resources Tom Glanzman put his BABAR expertise to minimize exposure to SLAC resource bottlenecks Accessing nfs from upwards of 400 CPUs was the biggest problem Use afs and batch node local disk as much as possible Made good use of SCS’ Ganglia server/disk monitoring tools Developed pipeline performance plots (as shown at the Kickoff meeting) Tom and I (mostly Tom) ran off the DC2 datasets Some complexity due to secret sky code and configs Some complexity due to last minute additions of variables calculated outside Gleam Effort front loaded – setting up tasks Now a fairly small load to monitor/repair during routine running Some cleanup at the end Root4Root5 transition disrupted the DataServer Will likely need a “volunteer” for future big LAT simulations
Grab Bag Great to have GBM involved! Should at least have archival copy of GBM simulation code used DC2 Confluence worked Nice organization by Seth on Forum and Analysis pages Easy to use and peruse Will clone for Beamtest Great teamwork It was really fun to work with this group The secret sky made it hard to ask many people to help with problems – but that is behind us now Histories Pointing and livetime needed manual intervention to fix SAA passages etc. Should track that down. Analysis details Might have been nice to have Class A/B in merit (IMHO) GTIs were a pain if you got them wrong. Tools now more tolerant. Livetime cubes were made by hand Diffuse Response in FT1 was somewhat cobbled together
GSSC Data Server 890 hits total during DC2 repopulating the server is manual; 2 months takes about 5 hrs brings up questions: what chunks of data will be retransmitted to GSSC? what are “failure modes” for data delivery what will “Event” data look like? how many versions of data to be kept online in servers?
LAT DataServer Usage ½ of usage from Julie! similar questions posed as from GSSC server
Lessons Statistics don’t include “astro” data server or WIRED event display use. Lessons Learned Problem: Jobs running out of time Need more accurate way to predict time, or run jobs with no time limit Problem: Need clearer notification to user if job fails LAT Astro server never got the GTIs right Hence little used, even as west coast US mirror Were not able to implement efficient connection to Root files (main reason for its existence). Still needs work. Unknown if limited use of Event Display is significant.