Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Mining Job Monitoring Data Automatic Error.

Similar presentations


Presentation on theme: "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Mining Job Monitoring Data Automatic Error."— Presentation transcript:

1 EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Mining Job Monitoring Data Automatic Error Source Detection of Grid Job Failures using Data Mining Techniques Gerhild Maier September 24 th 2008

2 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 2 Problem Description  We have... … a lot of information about jobs in the Dashboard database … exit codes … many tools to monitor jobs  We don’t have … … a clear classification of all exit codes; application exit codes are sometimes misleading  We want... … to look at the underlying problem … an automatic detection of the error source, the problematic Grid component … a generic tool for all big LHC experiments … a simple tool with few specification needed from the user

3 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 3 Approach  Step 1: data preprocessing –How much job information? –How many data sets?  Step 2: data mining –Supervised or unsupervised method? –Clustering? Classification? Decision tree? Association rules?  Step 3: output representation –Where to present the output? –Textual or graphical representation?

4 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 4 Step 1: data preprocessing  consider six job characteristics –username –site –computing element –storage element –filename –exit code  good/bad classification with Support Vector Machines  select job information over a two day period

5 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 5 Step 2: data mining (1/2)  Association Rule Mining –find frequent item sets in the database –item: attribute - value pair (e.g. site=CERN-PROD) –rule: {A, B}  {C}, where A, B, C are items and –support: how much data includes A, B and C? –confidence: if A, B are included, how much data also includes C? –e.g. {username=xxx, ce=cmsgrid02.hep.wisc.edu}  {exit code = 70500}  Example: CMS job monitoring data –2 day period –42667 analysis jobs –49 rules with exit code in the consequent of the rule

6 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 6 Step 2: data mining (2/2) Find frequent item set Create association rules Pruning the rules to eliminate redundancies … rule 1 rule 2... rule n rule 1 rule 2 … rule k item set 1 item set 2 item set n Apriori Algorithm Pruning Algorithm Set of association rules Job Monitoring Information of the Dashboard Database

7 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 7 Step 3: output representation(1/2)  QAOES: Quick Analysis Of Error Source  textual representation of the association rules

8 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 8 Step 3: output representation(2/2)  graphical representation of the rules  each line corresponds to one rule  each point corresponds to an item  {username=user224, site=GRIF}  {exitcode=10034}

9 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 9 Outlook  adapt the statistical measurement to define a rule as interesting in the pruning step  provide the prototype to shifters of the ATLAS distributed production system to help tracking errors


Download ppt "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Mining Job Monitoring Data Automatic Error."

Similar presentations


Ads by Google