Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Monitoring Use Cases 16 January 2012
Grid Technology Introduction Goal –Analyze different types uses cases from all IT groups –Identify few representative common uses cases Contribution requested under 3 categories –Fast & Furious (FF) alarms, end user views –Digging Deep (DD) infrequent analysis with lots of data, history analysis –Correlate & Combine (CC) combining data from different domains Contributions received from 7 groups –Input from DB and DSS missing
Grid Technology Fast & Furious GroupActionElement CFlist, join, alarmnodes, exceptions CISlist, alarmdocument queue, web servers CSlist, alarmrouter, switch, network DIlistconnections, urls, external locations ESlist, findjob status, transfer status, site status GTlist, findservices, sites PESlist, join, alarmusers, batch jobs, hardware
Grid Technology Fast & Furious Groups: –CF, DI, PES, ?? Role: –Sys Admin, Service Manager Tasks: –Get metrics values for hardware and selected services –Filter metrics per different types (role, cluster, etc) –Aggregate exceptions and errors –Raise alarms according to appropriate thresholds
Grid Technology Digging Deep GroupActionElement CFreport, reorderhistorical data CISreport, statisticshistorical data CScorrelatetraffic, route, configuration DIcorrelateip addresses, devices ESstatistics, reorderhistorical data GTstatisticsservice status, service availability PESstatisticshistorical data, CPU usage, disk usage
Grid Technology Digging Deep Groups: –CF, CS, ES, PES Role: –VO Admin, Service Manager Tasks: –Curation of hardware and network historical data –Analysis and statistics (trends) on batch job data, and network data
Grid Technology Correlate & Combine GroupActionElement CFcorrelateraw data, alarms, cluster, service, app CIScorrelateCDS, INSPIRE, usage, AFS CScorrelatehigh traffic, firewall load, hardware problems DIcorrelatep2p problems, outgoing connections, IPs EScorrelatejob/transfer rate/metadata, site status, fts, srm GTcorrelateservice status, job status PEScorrelatealarms, metrics, hardware, users, services
Grid Technology Correlate & Combine Groups: –CF, CIS, ES, GT, PES, ?? Role: –Service Manager Tasks: –Correlation between alarms, hardware, and services –Correlation between usage, hardware, and services –Correlation between job status and grid status