Download presentation
Presentation is loading. Please wait.
Published byBonnie Gallagher Modified over 9 years ago
1
DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006
2
Kaushik De 2 First – a Reminder Panda was designed to minimally depend on robustness of external middleware This does not apply to DDM – Panda fully depends on and takes advantage of all DQ2 capabilities Panda was the first ATLAS executor to use DQ2 for production – 6 months before the LCG (still not fully done) As you saw from Torre’s talk – Panda subscribes to thousands of datasets weekly, and BNL catalog holds more than a million records – leader in DQ2 deployment and use Panda-DDM is often used as an example of success in ATLAS – keep this up!
3
September 29, 2006 Kaushik De 3 Some Open Issues Deployment – need to do better (we have 19 installations) You have already heard from many speakers – no more to say Data and catalog consistency and cleanup – next few slides Use output file callback in Panda Performance and monitoring issues Alexei reminded us – also need script to cleanup obsolete datasets
4
September 29, 2006 Kaushik De 4 Data Transfer Robustness Need more robustness in DQ2 to recover from failures Examples in Alexei’s talk (but Panda usually had fewer problems) Important Panda issue: DQ2 should never give up on subscriptions But don’t kill site services because of retries – tricky balancing act! Force (email) human intervention if impossible to transfer file … this is normal hardening process – will continue In the meantime, production must continue Need to increase production rate by factor of 10 by summer 2007 In addition, there will always be some unavoidable error conditions We also need to do site cleanup of SE (cache turnover) Also, delete old temporary Panda datasets (safely): chron script So – some post DQ2 cleanup will always be necessary
5
September 29, 2006 Kaushik De 5 Proposal for DDM cleanup Check and repair consistency of site catalogs Script 1: re-register in local LRC all files found on local SE that are registered in DQ2 central catalog, but not at BNL T1 Marco is working on this script, based on scripts written by Patrick and Wensheng – need to run as chron at every site when stable Script 2: move old missing files to BNL periodically Chron run by Wensheng – need to define ‘old’ Script 3: safely cleanup SE space when getting full Wensheng’s script works well – sites should take over running it Keep log of all post-DQ2 repairs – feed back to developers so that DQ2 can improve based on real experience (feed back into monitoring?)
6
September 29, 2006 Kaushik De 6 Site Responsibilities Sites, sites, sites! Important difference between OSG sites and LCG sites – site mangers have always been proactive within U.S. DDM Probably reflected in T1/T2 test results! Need to keep this up – sites should check DDM monitor daily http://panda.atlascomp.org/?dash=prod&redirect=pandamon http://panda.atlascomp.org/?dash=prod&redirect=pandamon http://panda.atlascomp.org/?dash=prod&redirect=pandamon Site is responsible for maintaining local storage element and keeping various services up and running Sites should protect data in storage elements Some of our recent DDM problems have been site specific – need help from DDM operations to help (and often fix mistakes)
7
September 29, 2006 Kaushik De 7 Output Callbacks Converging on a solution – latest proposal by Torre Add new Panda job state - ‘transferring’ Enable callback for output subscription blocks Panda will change ‘transferring’ -> ‘finished’ when callback received Pros: Better tracking of output file transfers through Panda Production team can identify and report problems Cons: Jobs may remain un-finished, even though file is available at T2 (physicist can get file through DQ2 – but job not in finished state) Panda queue may grow very large
8
September 29, 2006 Kaushik De 8 Live Examples http://panda.atlascomp.org/?dash=prod&reload=yes
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.