EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Mining Job Monitoring Data Automatic Error Source Detection of Grid Job Failures using Data Mining Techniques Gerhild Maier September 24 th 2008
Enabling Grids for E-sciencE EGEE-III INFSO-RI Mining Job Monitoring Data Gerhild Maier 2 Problem Description We have... … a lot of information about jobs in the Dashboard database … exit codes … many tools to monitor jobs We don’t have … … a clear classification of all exit codes; application exit codes are sometimes misleading We want... … to look at the underlying problem … an automatic detection of the error source, the problematic Grid component … a generic tool for all big LHC experiments … a simple tool with few specification needed from the user
Enabling Grids for E-sciencE EGEE-III INFSO-RI Mining Job Monitoring Data Gerhild Maier 3 Approach Step 1: data preprocessing –How much job information? –How many data sets? Step 2: data mining –Supervised or unsupervised method? –Clustering? Classification? Decision tree? Association rules? Step 3: output representation –Where to present the output? –Textual or graphical representation?
Enabling Grids for E-sciencE EGEE-III INFSO-RI Mining Job Monitoring Data Gerhild Maier 4 Step 1: data preprocessing consider six job characteristics –username –site –computing element –storage element –filename –exit code good/bad classification with Support Vector Machines select job information over a two day period
Enabling Grids for E-sciencE EGEE-III INFSO-RI Mining Job Monitoring Data Gerhild Maier 5 Step 2: data mining (1/2) Association Rule Mining –find frequent item sets in the database –item: attribute - value pair (e.g. site=CERN-PROD) –rule: {A, B} {C}, where A, B, C are items and –support: how much data includes A, B and C? –confidence: if A, B are included, how much data also includes C? –e.g. {username=xxx, ce=cmsgrid02.hep.wisc.edu} {exit code = 70500} Example: CMS job monitoring data –2 day period –42667 analysis jobs –49 rules with exit code in the consequent of the rule
Enabling Grids for E-sciencE EGEE-III INFSO-RI Mining Job Monitoring Data Gerhild Maier 6 Step 2: data mining (2/2) Find frequent item set Create association rules Pruning the rules to eliminate redundancies … rule 1 rule 2... rule n rule 1 rule 2 … rule k item set 1 item set 2 item set n Apriori Algorithm Pruning Algorithm Set of association rules Job Monitoring Information of the Dashboard Database
Enabling Grids for E-sciencE EGEE-III INFSO-RI Mining Job Monitoring Data Gerhild Maier 7 Step 3: output representation(1/2) QAOES: Quick Analysis Of Error Source textual representation of the association rules
Enabling Grids for E-sciencE EGEE-III INFSO-RI Mining Job Monitoring Data Gerhild Maier 8 Step 3: output representation(2/2) graphical representation of the rules each line corresponds to one rule each point corresponds to an item {username=user224, site=GRIF} {exitcode=10034}
Enabling Grids for E-sciencE EGEE-III INFSO-RI Mining Job Monitoring Data Gerhild Maier 9 Outlook adapt the statistical measurement to define a rule as interesting in the pruning step provide the prototype to shifters of the ATLAS distributed production system to help tracking errors