Site Manageability Issues for LCG Ian Bird IT Department, CERN HEPiX JLab, 12 th October 2006.

Site Manageability Issues for LCG Ian Bird IT Department, CERN HEPiX JLab, 12 th October 2006

October 7, 2005 2 Ian.Bird@cern.ch HEPiX; JLab 12 th October 2006 Introduction  For LCG started to measure site and service availability  Not encouraging:  Even for Tier 1 sites service (and hence site) availability is not good (missing targets)  And is not getting better  How can we expect Tier 2 sites to be reliable?  Not new: this has been a problem for >1 year  Site instabilities observed  Strategies?  Improve reliability of sites  Improve job ability to dynamically find OK sites

SAM Status, EGEE’06 Conference, 25-29 September 2006 3 SAM Deployment SAM contains all the tests from SFT but in addition: new transport layer and DB monitoring of other services availability metric calculation web service based access to data Goal: Replace SFT by SAM, phase out SFT (possibly by the end of year) “Almost” done for EGEE Production and PPS Still some components missing: COD Dashboard integration - essential for operations others? Other users that didn’t migrate to SAM: regional installations (certification) - how many? other projects - which ones?

SAM Status, EGEE’06 Conference, 25-29 September 2006 4 Deployment status (Production) SAM Submission SFT Submission SFT Admin SFT Server (MySQL DB) SAM Server (Oracle DB) GStat RB monitor COD Dashboard SAM Portal SFT Portal Metrics (GridView, XSQL) Metrics (GridView, XSQL) CIC ROC report FCR SAM Admin In addition: Web service API for other tools (SLS, alarms, custom query for Atlas) Experimental test results for SRMs from PhEDEx (CMS)

SAM Status, EGEE’06 Conference, 25-29 September 2006 5 Sensors overview Integrated with submission framework Test jobs: currently CE and gCE sensor which replaces SFT submission framework Remote probing: SE, SRM, LFC, FTS (!) Standalone sensors - existing monitoring tools publishing to SAM DB: RB job monitor - maintained at RAL, UK GStat - site-BDII and toplevel-BDII maintained at Sinica, Taiwan

SAM Status, EGEE’06 Conference, 25-29 September 2006 6 Integrated sensors - test jobs Jobs submitted to all CEs and gLite-CEs listed in GOC DB and/or BDII through a selected RB Test job is executed on one WorkerNode(selected by the batch system) at the site Performs a number of tests and returns the results to SAM Server Delay between submission time and results: about 5 minutes to few hours when site is under heavy load Frequency: currently every 2 hours for OPS VO

SAM Status, EGEE’06 Conference, 25-29 September 2006 7 Integrated sensors - remote probing Set of tests executed on a central UI where SAM Client is installed Each test gets the host name of the service node to test, and checks the node remotely by utilising grid command line tools Results returned immediately (exception: FTS) Possibility of delayed results - example: FTS sensor that submits test transfer jobs Frequency: currently once per hour for OPS

SAM Status, EGEE’06 Conference, 25-29 September 2006 8 Existing tests - CE, gCE job submission - UI->RB->CE->WN chain version of CA certificates installed (on WN!) version of software middleware (on WN!) broker info - checking edg-brokerinfo command UNIX shells environment consistency (BASH vs. CSH) replica management tests - using lcg-utils, default SE defined on WN and a selected “central” SE (3-rd party replication) accessibility of experiments’ software directory - environment variables, directory existence accessibility of VO tag management tools other tests: R-GMA client check, Apel accounting records

SAM Status, EGEE’06 Conference, 25-29 September 2006 9 Existing tests - SE, SRM storing file from the UI - using lcg-cr command with LFC registration getting file back to the UI - using lcg-cp command removing file - using lcg-del command with LFC de-registration

SAM Status, EGEE’06 Conference, 25-29 September 2006 10 Existing tests - LFC, FTS LFC directory listing - using lfc-ls command on /grid creating file entry in /grid/ area FTS checking if FTS is published correctly in the BDII channel listing - using glite-transfer-channel-list command with ChannelManagement service transfer test (in development): submitting transfer jobs between SRMs in all Tier0 and Tier1 sites (N-N testing) checking the status of jobs Note! The test is relying on availability of SRMs in sites

SAM Status, EGEE’06 Conference, 25-29 September 2006 11 Existing standalone sensors GStat: site-BDIIs: accessibility (response time), sanity checks (partial Glue schema validation) top-level BDIIs: accessibility (response time), reliability of data (number of entries) RB: jobs submitted through all important RBs to selected “reliable” CEs measuring time of matchmaking

SAM Status, EGEE’06 Conference, 25-29 September 2006 12 Availability Metric - Proposal 1) Metrics for sites: site core services, aggregation of: CE, Site-BDII, SE VO-specific services (could include VO-tests) 2) Metrics for central services aggregated by VO Per VO: 1) AND 2) Interface to query the DB and get data to play with it

SAM Status, EGEE’06 Conference, 25-29 September 2006 13 Availability metrics - algorithm t ∈ CriticalTests TestResult (N,t)Status of node N = Status of site S = CE1 CE2 CEn SRM 1 SRM 2 SRM n site BDII AND OR Everything is calculated for each VO that defined critical tests in FCR Results make sense only if VO submits tests!!! N ∈ instances(C) Status (N)Status of central service C = AND OR ∧ = boolean AND ∨ = boolean OR

SAM Status, EGEE’06 Conference, 25-29 September 2006 14 Availability metrics - algorithm II Service and site status values are recorded every hour (24 snapshots per day) Daily, weekly, monthly availability is calculated using integration (averaging) over the given period Scheduled downtime information from GOC DB is also recorded and integrated for comparisons Details of the algorithm on GOC: http://goc.grid.sinica.edu.tw/gocwiki/SAME_Metrics_calculation http://goc.grid.sinica.edu.tw/gocwiki/SAME_Metrics_calculation

SAM Status, EGEE’06 Conference, 25-29 September 2006 15 Availability metrics - GridView

SAM Status, EGEE’06 Conference, 25-29 September 2006 16 Availability metrics - data export

SAM Status, EGEE’06 Conference, 25-29 September 2006 17 Ongoing work VO specific submission: LHCb already submits jobs (not frequently) Atlas - will be submitted from central SAM UI Jobwrapper tests - see next slide Alarm system interface DB model, triggers, web service interface basic development done in SAM development of the interface in COD dashboard is in progress

SAM Status, EGEE’06 Conference, 25-29 September 2006 18 Open issues All sensors have to be reviewed and fixed: check if tests reflect real usage (experiments) avoid dependencies on central services if possible increase reliability of results (false positive alarms) increase tests verbosity (where applicable) Missing sensor/tests have to be written All tests should be well documented (TestDef inline doc + Wiki) Alarm masking is still missing (some development done, should be integrated in new COD Dashboard) - another side of false positive alarms problems Availability metric should be reviewed - is the current one the way to go?

October 7, 2005 19 Ian.Bird@cern.ch HEPiX; JLab 12 th October 2006

Analysis of job efficiencies

21 Automatic report generation (efficiency as seen by analysis jobs) Publish data every day: http://dboard-gr.cern.ch:8888/ dashboard/data/yesterday.html http://dboard-gr.cern.ch:8888/ dashboard/data/yesterday.html Select a few sites which can be improved and inject this in in the CIC procedure (GGUS ticket generated) The procedure seems to be reasonable Data from CMS and ATLAS Extract jobID, worker node, additional information (error conditions breakdown)

22 Job Attempt errors Already analyzed more than 400K jobs (~ 540K job attempts) More details than just ‘last error message’ Find errors that were recovered thanks to resubmission Cannot read JobWrapper output both from Condor and from Maradona51k Job RetryCount27k Got a job held event reason: Unspecified gridmanager error21k Globus error 131: the user proxy expired job is still running20k Job got an error while in the CondorG queue20k No compatible resources15k Cannot download from 6k 7 authentication with the remote server failed3k 22 the job manager failed to crate an internal script argument file3k Other errors26K

23 CMS SC4 jobs A few misconfigured worker nodes spotted and fixed  dramatic improvement of the site behaviour

24 Example 20 th of June  JobRobot (CMS SC4)  6 sites showing problems  Top “good” sites (“grid” efficiency) MIT = 99.6% DESY = 100.% Bari = 100 % Pisa = 100% FNAL = 100% ULB-VUB = 96.8% KBFI = 100% CNAF = 99.6 ITEP = 100%

25 CMS SC4 DayTop “good” sitesCommentsOverall efficiency * June 17FNALDESYLegnaro88% June 19UNLPisaBaribad sites72% June 20MITBariPisabad sites64% June 21PasadenaBariPisaMatch problems 69% June 22PasadenaBariFNALMatch problems 73% June 23MITPadadenaPICBad site77% * All sites in (“good” and “bad” ones)

26 Snapshots of CMS analysis DateTop “good” site grid efficiency Overall grid efficiency Comments April 2495%-100%50%Match problems May 2484%-99%60%1 bad site June 2199-100%87% Last days~100%Lots of job pending April 24No entries May 2495%-100%25%-50% Big site problems June 2384%-99%91% Very stable June 2499-100% Lots of job running

27 CERN FNAL IN2P3 Site efficiency (jobs sent to a site) as seen with CMS jobs for 3 Tier1 (available for all sites)

28 Total

SAM Status, EGEE’06 Conference, 25-29 September 2006 29 JobWrapper tests I Motivation: –Clear that SFTs do not find all site related problems SAM/SFT jobs are not reaching all WNs, thus not detecting broken WNs which can lead to unacceptable failure rate Need a more direct measure of site quality Simplified set of tests executed on WNs by the CE wrapper with each grid job Will potentially reach all WNs in relatively short time Test results will be passed to the job but also published to the central DB Core scripts have to be installed on CE and WNs (will become part of the release) Set of tests will be updated centrally (signed tarball installed in software area)

SAM Status, EGEE’06 Conference, 25-29 September 2006 30 JobWrapper tests II Gains for operations: infrastructure description with unique identification of WNs: relation between CEs, batch queues, WNs detection of monitoring queues pointing to “carefully” selected WN counting WNs at the site without risk of double counting due to shared batch farms detection of sites with broken WNs: basic fabric monioring for small sites Status: scripts, tests, transport layer ready and certified on CTB deployment on PPS should start soon visualisation and data processing tools have to be written (team in Wuppertal?)

October 7, 2005 31 Ian.Bird@cern.ch HEPiX; JLab 12 th October 2006 Observations  Grid middleware (services) is not the most reliable software  From the ground up  Missing management interfaces, consistent logging, etc.  Site stability is not stable  Good site one day is a bad site tomorrow  Site Functional Tests (SFT) has helped – this is how we know sites are not good  But:  SFT does not show all problems – experiments see failures even if SFTs OK  SFT is necessary but not sufficient  Job level SFT  Many observed problems are not grid related at all – site failures, batch systems, etc. etc.  Many new, inexperienced sites now running computer centres – and they are becoming critical for overall availability

October 7, 2005 32 Ian.Bird@cern.ch HEPiX; JLab 12 th October 2006 What is needed?  Fabric Management  Batch systems, file systems, etc.  Security (IDS, etc)  Training  Cookbook/best practices  Fabric Monitoring  Local monitors  Remote monitors (of Tier 2?)  GridIce, IN2P3  Share sensors  Share tools  Job wrapper monitoring  Grid Service Management  Standard service interface?  (processes, status, start/stop)  Std framework?  Missing SAM tools?  Relate to local issues (e.g. is FTS pushing to a site?)  Grid Service Monitoring  SFT/SAM for a site  Job wrapper monitors  Standard dashboards (for site) (for VO) (for services)  Basic services monitoring  Accounting (CPU + Storage)  Metrics Tools  GridView  GridIce  SAM/SFT  L&B stats  RTM  R-GMA  MonaLisa  Lemon/SLS  Nagios  Ganglia  RSS  …

October 7, 2005 33 Ian.Bird@cern.ch HEPiX; JLab 12 th October 2006 Can HEPiX help?  Site management:  Cookbook?  Best practices & useful procedures  Site security tools  Sharing tools  Training?  Fabric monitoring  How can HEPiX help grid sites implement reliable fabric monitoring?  What are essentials?  What tools are available  How can a site become more reliable?  Workshops at next HEPiX?  Or independently? – draw on expertise from HEPiX sites  Top 5 issues  From ops etc?  Ask labs to explain how they address them

Site Manageability Issues for LCG Ian Bird IT Department, CERN HEPiX JLab, 12 th October 2006.

Similar presentations

Presentation on theme: "Site Manageability Issues for LCG Ian Bird IT Department, CERN HEPiX JLab, 12 th October 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Site Manageability Issues for LCG Ian Bird IT Department, CERN HEPiX JLab, 12 th October 2006.

Similar presentations

Presentation on theme: "Site Manageability Issues for LCG Ian Bird IT Department, CERN HEPiX JLab, 12 th October 2006."— Presentation transcript:

Similar presentations

About project

Feedback