Download presentation
Presentation is loading. Please wait.
Published bySteven Farmer Modified over 8 years ago
1
EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Mining Job Monitoring Data Automatic Error Source Detection of Grid Job Failures using Data Mining Techniques Gerhild Maier September 24 th 2008
2
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 2 Problem Description We have... … a lot of information about jobs in the Dashboard database … exit codes … many tools to monitor jobs We don’t have … … a clear classification of all exit codes; application exit codes are sometimes misleading We want... … to look at the underlying problem … an automatic detection of the error source, the problematic Grid component … a generic tool for all big LHC experiments … a simple tool with few specification needed from the user
3
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 3 Approach Step 1: data preprocessing –How much job information? –How many data sets? Step 2: data mining –Supervised or unsupervised method? –Clustering? Classification? Decision tree? Association rules? Step 3: output representation –Where to present the output? –Textual or graphical representation?
4
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 4 Step 1: data preprocessing consider six job characteristics –username –site –computing element –storage element –filename –exit code good/bad classification with Support Vector Machines select job information over a two day period
5
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 5 Step 2: data mining (1/2) Association Rule Mining –find frequent item sets in the database –item: attribute - value pair (e.g. site=CERN-PROD) –rule: {A, B} {C}, where A, B, C are items and –support: how much data includes A, B and C? –confidence: if A, B are included, how much data also includes C? –e.g. {username=xxx, ce=cmsgrid02.hep.wisc.edu} {exit code = 70500} Example: CMS job monitoring data –2 day period –42667 analysis jobs –49 rules with exit code in the consequent of the rule
6
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 6 Step 2: data mining (2/2) Find frequent item set Create association rules Pruning the rules to eliminate redundancies … rule 1 rule 2... rule n rule 1 rule 2 … rule k item set 1 item set 2 item set n Apriori Algorithm Pruning Algorithm Set of association rules Job Monitoring Information of the Dashboard Database
7
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 7 Step 3: output representation(1/2) QAOES: Quick Analysis Of Error Source textual representation of the association rules
8
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 8 Step 3: output representation(2/2) graphical representation of the rules each line corresponds to one rule each point corresponds to an item {username=user224, site=GRIF} {exitcode=10034}
9
Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Mining Job Monitoring Data Gerhild Maier 9 Outlook adapt the statistical measurement to define a rule as interesting in the pruning step provide the prototype to shifters of the ATLAS distributed production system to help tracking errors
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.