Download presentation
Presentation is loading. Please wait.
Published byNathaniel Steven Butler Modified over 9 years ago
1
USCMS T2 Site Admin Toolkit Samir Cury MTF Meeting – May 26 th, 2011
2
How it began OSG All Hand Meeting 2010 Fermilab Yearly T2 Workshop Gathering of site admins A lot of ideas/comments Some code – Scripts
3
About site admins Frontline of site management They have in a Daily basis : Many requests Many issues Many workarounds – What happen with these? Relevant feedback for CMS Leak of features in existing software Leak of monitoring in existing systems May lead to Blindly operating it Is there always someone to listen? Thanks Monitoring Task Force!
4
Workarounds From the past slide, this toolkit is all about that. Not always complaining is the best way It may never be implemented Not everyone will see the benefits/cost Different needs Not always developers think about all user/ops needs Scripts are done to cover these needs These scripts can give a different approach to the ops Monitoring tools focused in admin's needs. Can improve response time / error/waste detection » Example – GridFTP Spy » JobView / CPU Efficiency on T1's Not essential, but normally saves some time.
5
The goal What is really missing – Official place for unofficial code – People get encouraged to share Call for tools Get the generic ones –> package into RPM Get the specific ones Turn into generic, then package into RPM Standard place (repository) Standard deploy procedure If it's not quick, no one tries. → RPM's Helping us to help ourselves.
6
What it is Full documentation/reference available : https://twiki.cern.ch/twiki/bin/view/CMS/SIteAdminToolkit Where we document each tool included in the toolkit, future plans, etc. A gathering of scripts, that may need some work to get it working We also try to avoid that by having RPMs and all dependencies included – packages or in the repos. A free-time-task for every involved person We normally don't have schedules, but a plan. Shameless “coders” - that's what we need! We don't care how “bad written” it is, as long as it works
7
What certainly is not Something that is maintained by a lot of people But some that contribute with tools A dependency-solver / packager (me) Would appreciate some help Something that will solve all the problems That is not the goal, just to put together specific tools Something that has “professional quality” Involved people are very capable, but proportionaly time-constrained
8
What we can learn “Sites” can also generate some useful code They probably will do it for themselves, so don't expect High quality code Something that has not a lot of dependencies Expect Tools that you can adapt for your site with little effort To contribute and make it better instead of complaining “Sites” should be shameless enough to publish (and send us) tools they find useful. Ken bloom gave me space for a contribution on a USCMS T2 support meeting so I could present the proposal, then, some tools showed up. (Thanks, Ken!) T2 Coordinators could inform us when they see something useful in their support meetings, and also remind these sites that the toolkit is there
9
What I did learn Since getting the script until the RPM gives more work than I thought – many details, dependencies, etc... We will live better if we have a step before this : https://github.com/samircury/US-CMS-T2-Admin- Toolkit People can download/edit from there, and is a shortcut for the ones that really want to spend some time understanding and deploying the tools that still don't have the RPM. It helped me to patch Stale Data improving the CLI
10
Tools we have right now CondorView (Caltech) - RPM ready GridFTP Spy (Caltech) – RPM ready Condor4Web (UERJ) - RPM ready Stale Data (Nebraska) – tested, needs packaging Condor Extract Mail (Nebraska) – to be tested Dcache tools (Wisconsin) – to be tested Your tool here
11
CondorView GUI for managing condor List every single job Can list ALL classAds for a given job Can do what you see in the menu Run from the cluster frontend Have the ability of SSH to the node, exactly into the running job temp dir Run from the site's CE Have the ability of killing/releasing/restart jobs
12
GridFTP Spy Shows in near real time active GridFTP transfers Very useful for link usage / server settings optimizing Somewhat tricky to deploy Needs a shared FS for harvesting logs How it does is reading the logs in real time and gathering interesting info Never tested it myself – testers are welcome!
13
Condor4web Real time batch system monitoring Visible from any corner of the world Your users like it They know what's going on with their jobs, after the CE MC People like it For the same reason. Live demos : http://monitor.hepgrid.uerj.br/condor/ http://www.cmsaf.mit.edu/condor4web/ If you don't use Condor, try JobView : https://twiki.cern.ch/twiki/bin/viewauth/CMS/Analys isOpsT2Monitoring
14
Stale Data Looks like the (un)popularity data service Shows which datasets people didn't run a single job against Tested. Works fine, has a lot of dependencies which should be included in the RPM date = 15-12-2010, Starting Date = 01-12-2010 Getting json http://dashb-datasets.cern.ch/dashboard/request.py/inputCollectionsTable_JSON?collec_name=&sites=T2_BR_UERJ&date1=01-12-2010&date2=15-12-2010 Datasets idle since 01-12-2010 /JetMET/Run2010A-Dec4ReReco_v1/AOD, 2474.004614433 GB, Owned by AnalysisOps /G2Jets_Pt-20to60_TuneZ2_7TeV-alpgen/Fall10-START38_V12-v1/AODSIM, 190.267690679 GB, Owned by top /W2Jets_ptW-0to100_TuneZ2_7TeV-alpgen-tauola/Fall10-START38_V12-v1/GEN, 0.686380407 GB, Owned by DataOps /QCD6Jets_Pt120to280-alpgen/Spring10-START3X_V26_S09-v1/GEN-SIM-RECO, 42.528487201 GB, Owned by top /W1Jets_ptW-800to1600_TuneD6T_7TeV-alpgen-tauola/Fall10-START38_V12-v1/AODSIM, 11.951159415 GB, Owned by top (Suppressed) Space taken by stale datasets = 408.164419749117 TB Broken down by group: tracker-dpg => 9.250565041201 top => 40.841314603557 AnalysisOps => 157.50586599848 undef => 15.736526476068 FacOps => 1.899973228744 b-tagging => 18.694190177731 local => 164.130428192715 DataOps => 0.105556030621
15
“Condor Extract Mail” Fetches from grid proxies in your CE's, mails from the users running jobs in your cluster [root@red ~]# ~bbockelm/extract_email "Bockelman" bbockelm@cse.unl.edu
16
What CMS can profit Better than the code, the ideas Usability – you may find here potential features for existing real software Adapt ideas or tools that diserve to CMS central monitoring like cmsweb Gives an overview of site admin needs and what they would like to see in the software they use. Some become patches – like Brian Bockelman's script The model / idea of a free software community is a good example to follow – Small patches from many people turn small things into great ones. Share!
17
Thanks all involved Ken Bloom, Michael Thomas – Initial effort to set up and make everything public Authors that submitted tools : Caltech – Michael Thomas CondorView GridFTP Spy Nebraska – Carl Lundsted and Brian Bockelman Condor Extract Mail Stale Data Wisconsin - Will dCache Tools UERJ – Samir Condor4Web
18
Feel free to send : Tools Suggestions Help But first, we recommend some (small) reading here : https://twiki.cern.ch/twiki/bin/view/CMS/SIteAdminToolkit
19
For the future 2 Trainees interested in help packaging @ UERJ Migrate YUM Repos to CERN webservers Finish testing/package tools we already have.
20
Contacts Samir.cury.siqueira@cern.ch jafonso@cern.ch
21
Recommended toolkit http://datagrid.ucsd.edu/toolkit/
22
Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.